Knowledge and Spatial Pyramid Distance-Based Gated Graph Attention Network for Remote Sensing Semantic Segmentation

: The pixel-based semantic segmentation methods take pixels as recognitions units, and are restricted by the limited range of receptive ﬁelds, so they cannot carry richer and higher-level semantics. These reduce the accuracy of remote sensing (RS) semantic segmentation to a certain extent. Comparing with the pixel-based methods, the graph neural networks (GNNs) usually use objects as input nodes, so they not only have relatively small computational complexity, but also can carry richer semantic information. However, the traditional GNNs are more rely on the context information of the individual samples and lack geographic prior knowledge that reﬂects the overall situation of the research area. Therefore, these methods may be disturbed by the confusion of “different objects with the same spectrum” or “violating the ﬁrst law of geography” in some areas. To address the above problems, we propose a remote sensing semantic segmentation model called knowledge and spatial pyramid distance-based gated graph attention network (KSPGAT), which is based on prior knowledge, spatial pyramid distance and a graph attention network (GAT) with gating mechanism. The model ﬁrst uses superpixels (geographical objects) to form the nodes of a graph neural network and then uses a novel spatial pyramid distance recognition algorithm to recognize the spatial relationships. Finally, based on the integration of feature similarity and the spatial relationships of geographic objects, a multi-source attention mechanism and gating mechanism are designed to control the process of node aggregation, as a result, the high-level semantics, spatial relationships and prior knowledge can be introduced into a remote sensing semantic segmentation network. The experimental results show that our model improves the overall accuracy by 4.43% compared with the U-Net Network, and 3.80% compared with the baseline GAT network.


Introduction
Currently, the pixel-based methods [1][2][3][4][5] tend to take pixels as recognitions units to achieve remote sensing semantic segmentation, fuse the information of the area covered by the convolution kernel through the convolution operation. However, these methods cannot use high-level semantics, such as spatial relations and other information. Besides, the receptive fields in convolution are limited (generally 3 × 3) [6] and unevenly distributed [7], so it is hard to integrate context information of a larger area or obtain global information more evenly.
In response to the above problems, different scholars have carried out fruitful research [8,9]. Although the non-local method [8] can more effectively use the context information of the sample, it is computationally expensive and lack high-level semantics, such as spatial relations and other information.
With the development of graph neural networks, they have attracted increasing attention in GNN-based semantic segmentation [10][11][12]. Comparing with the pixel-based semantic segmentation methods, the current GNNs usually take objects as input nodes, in which computational complexity is relatively small. Besides, they are free from saltand-pepper effects and can carry richer semantic information. However, the following two problems urgently need to be solved for them to be better applied in remote sensing recognition: (1) Confusion of "different objects with the same spectrum" The phenomenon of "different objects with the same spectrum" is a common problem in remote sensing analysis. To correctly identify the objects which are disturbed by this problem, it often requires knowing their surrounding objects. If considering only the similarity of the spectral features, such as the adjacency matrix based on similarity of nodes in graph attention network (GAT), the network will be vulnerable to this problem, thereby resulting in the misclassification of central node. The following figure shows two types of this problem: the flat_field and the town; the city_grass and the flat_field.
As shown in Figure 1, without the surrounding environment, the spectral of the town object A is similar to the flat_field object B, and the spectral of the city_grass object D is similar to that of the flat_field object C. Therefore, it is necessary to fuse the spatial relationships into the adjacency matrix. However, the spatial relationship derived from an individual sample often leads to the following problems.
Remote Sens. 2021, 13, x FOR PEER REVIEW 2 of 31 In response to the above problems, different scholars have carried out fruitful research [8,9]. Although the non-local method [8] can more effectively use the context information of the sample, it is computationally expensive and lack high-level semantics, such as spatial relations and other information.
With the development of graph neural networks, they have attracted increasing attention in GNN-based semantic segmentation [10][11][12]. Comparing with the pixel-based semantic segmentation methods, the current GNNs usually take objects as input nodes, in which computational complexity is relatively small. Besides, they are free from saltand-pepper effects and can carry richer semantic information. However, the following two problems urgently need to be solved for them to be better applied in remote sensing recognition: (1) Confusion of "different objects with the same spectrum" The phenomenon of "different objects with the same spectrum" is a common problem in remote sensing analysis. To correctly identify the objects which are disturbed by this problem, it often requires knowing their surrounding objects. If considering only the similarity of the spectral features, such as the adjacency matrix based on similarity of nodes in graph attention network (GAT), the network will be vulnerable to this problem, thereby resulting in the misclassification of central node. The following figure shows two types of this problem: the flat_field and the town; the city_grass and the flat_field.
As shown in Figure 1, without the surrounding environment, the spectral of the town object A is similar to the flat_field object B, and the spectral of the city_grass object D is similar to that of the flat_field object C. Therefore, it is necessary to fuse the spatial relationships into the adjacency matrix. However, the spatial relationship derived from an individual sample often leads to the following problems. (2) Confusion of spatial distance-"violating the first law of geography" According to Tobler's first law of geography, everything is related to everything else, but near things are more related to each other [13]. However, due to the limitation of the sample cutting the neighbors of some nodes in the sample may not be complete and accurate. If the spatial relationship is only considered in the individual sample, it will cause the distortion of the spatial relationships between the node and their neighbors, and may lead to the misclassification of the node. The following, Figure 2, shows two objects that are cut at the corners of samples. (2) Confusion of spatial distance-"violating the first law of geography" According to Tobler's first law of geography, everything is related to everything else, but near things are more related to each other [13]. However, due to the limitation of the sample cutting the neighbors of some nodes in the sample may not be complete and accurate. If the spatial relationship is only considered in the individual sample, it will cause the distortion of the spatial relationships between the node and their neighbors, and may lead to the misclassification of the node. The following, Figure 2, shows two objects that are cut at the corners of samples. In response to the above problems, different scholars have carried out fruitful research [8,9]. Although the non-local method [8] can more effectively use the context information of the sample, it is computationally expensive and lack high-level semantics, such as spatial relations and other information.
With the development of graph neural networks, they have attracted increasing attention in GNN-based semantic segmentation [10][11][12]. Comparing with the pixel-based semantic segmentation methods, the current GNNs usually take objects as input nodes, in which computational complexity is relatively small. Besides, they are free from saltand-pepper effects and can carry richer semantic information. However, the following two problems urgently need to be solved for them to be better applied in remote sensing recognition: (1) Confusion of "different objects with the same spectrum" The phenomenon of "different objects with the same spectrum" is a common problem in remote sensing analysis. To correctly identify the objects which are disturbed by this problem, it often requires knowing their surrounding objects. If considering only the similarity of the spectral features, such as the adjacency matrix based on similarity of nodes in graph attention network (GAT), the network will be vulnerable to this problem, thereby resulting in the misclassification of central node. The following figure shows two types of this problem: the flat_field and the town; the city_grass and the flat_field.
As shown in Figure 1, without the surrounding environment, the spectral of the town object A is similar to the flat_field object B, and the spectral of the city_grass object D is similar to that of the flat_field object C. Therefore, it is necessary to fuse the spatial relationships into the adjacency matrix. However, the spatial relationship derived from an individual sample often leads to the following problems. (2) Confusion of spatial distance-"violating the first law of geography" According to Tobler's first law of geography, everything is related to everything else, but near things are more related to each other [13]. However, due to the limitation of the sample cutting the neighbors of some nodes in the sample may not be complete and accurate. If the spatial relationship is only considered in the individual sample, it will cause the distortion of the spatial relationships between the node and their neighbors, and may lead to the misclassification of the node. The following, Figure 2, shows two objects that are cut at the corners of samples. According to the aggregation principle of GNNs, the neighbor nodes will their influence on the central node. When considering the spatial relationships, the weight of an individual object becomes smaller as the distance increases. The following, Figure 3, shows a central node and its neighbors.
Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 31 Figure 2. Two objects that are cut at the corners of samples. (a,c) are two samples; (b,d) are corresponding surrounding environment of (a,c). The forest object A and the city_forest object B are two objects.
According to the aggregation principle of GNNs, the neighbor nodes will their influence on the central node. When considering the spatial relationships, the weight of an individual object becomes smaller as the distance increases. The following, Figure 3, shows a central node and its neighbors. As in the schematic diagram, is the central node, , , and represent the neighbor node with a spatial distance of 1, 2, and 3, respectively. Then, the aggregation formula of the central node can be expressed as follows: represent the set of neighbour nodes with a spatial distance of 1, 2, and 3, respectively, and α is the aggregation weight.
The farther the distance from the central node is, the more the number of equidistant neighbor nodes there are, i.e., . If there are geographic objects with same category in the far distance, they may accumulate to cause a greater impact on the central node, even more than the nearby objects. This may lead to the problem of "violating the first law of geography", and cause the misclassification of the central node.
Considering these, it is hard for the traditional GNNs to effectively solve the above two problems because they rely only on the context information of the individual sample. Therefore, it may require geographic prior knowledge based on the whole research area to solve them. Therefore, the KSPGAT network, which is a remote sensing semantic segmentation model based on prior knowledge, spatial pyramid distance and GAT with gating mechanism is proposed. The KSPGAT network takes geographic objects as the unit of segmentation. Its computational complexity is relatively small, and it is free from salt-and-pepper effects. Additionally, it can carry richer and higher-level semantics, such as spatial relations and the category co-occurrence prior knowledge; thereby, it can better recognize the objects disturbed by the above two problems.
In summary, the main contributions of this paper are as follows: (1) A novel spatial correlation recognition algorithm based on the spatial pyramid distance is proposed.
(2) A gating mechanism based on prior knowledge is proposed to realize the control of aggregation of neighbor nodes in graph neural network.
(3) A graph neural network model for remote sensing semantic segmentation is constructed, which effectively integrates the similarity of geographic objects, spatial relationships and global geographic prior knowledge. As in the schematic diagram, O i is the central node, O 1 j , O 2 k , and O 3 l represent the neighbor node with a spatial distance of 1, 2, and 3, respectively. Then, the aggregation formula of the central node O i can be expressed as follows: where N 1 , N 2 , N 3 represent the set of neighbour nodes with a spatial distance of 1, 2, and 3, respectively, and α is the aggregation weight. The farther the distance from the central node is, the more the number of equidistant neighbor nodes there are, i.e., N 1 < N 2 < N 3 . If there are geographic objects with same category in the far distance, they may accumulate to cause a greater impact on the central node, even more than the nearby objects. This may lead to the problem of "violating the first law of geography", and cause the misclassification of the central node.
Considering these, it is hard for the traditional GNNs to effectively solve the above two problems because they rely only on the context information of the individual sample. Therefore, it may require geographic prior knowledge based on the whole research area to solve them. Therefore, the KSPGAT network, which is a remote sensing semantic segmentation model based on prior knowledge, spatial pyramid distance and GAT with gating mechanism is proposed. The KSPGAT network takes geographic objects as the unit of segmentation. Its computational complexity is relatively small, and it is free from salt-and-pepper effects. Additionally, it can carry richer and higher-level semantics, such as spatial relations and the category co-occurrence prior knowledge; thereby, it can better recognize the objects disturbed by the above two problems.
In summary, the main contributions of this paper are as follows: (1) A novel spatial correlation recognition algorithm based on the spatial pyramid distance is proposed.
(2) A gating mechanism based on prior knowledge is proposed to realize the control of aggregation of neighbor nodes in graph neural network.
(3) A graph neural network model for remote sensing semantic segmentation is constructed, which effectively integrates the similarity of geographic objects, spatial relationships and global geographic prior knowledge.
The remainder of this paper is organized as follows: Section 2 discusses related work. A remote sensing semantic segmentation model based on prior knowledge, spatial pyramid distance and GAT with gating mechanism is presented in Section 3. The experiments are Remote Sens. 2021, 13, 1312 4 of 31 provided in Section 4. In addition, the analysis is presented in Section 5. The conclusion is provided in Section 6.

Related Work
Unlike pixels, which are usually the smallest unit of RS image analysis, image-objects are defined by Hay et al. [14] as basic entities, located within an image that are perceptually generated from H-res pixel groups. As geographic objects can provide rich spatial relationships information than pixels, the OBIA methods are making considerable progress.
The OBIA methods usually use segmentation algorithm to obtain the objects first and then use the objects for subsequent image analysis. Currently, the OBIA methods are widely applied in multi-scale research [15,16], change detection [17] and landslide detection [18]. To better understand ecological patterns, it is also expanded to the species-level mapping of vegetation [19]. Other research, like References [20,21], presented a comparative evaluation of the pixel-based method and the object-based; especially, Reference [21] compared the pixel-based support vector machine (SVM) classification and decision-tree-oriented geographic object-based image analysis (GEOBIA) classification, which indicated that the GEOBIA classification provided the highest accuracy. Besides, work, like Reference [22], discussed the idea and method of geographic ontology modeling based on object-oriented remote sensing technology in detail.
On this basis, the GEOBIA methods based on neural networks have attracted an increasing attention. For example, the research in Reference [23] proposed a novel land use classification method for high-resolution remote sensing images, and the method is based on a parallel spectral-spatial convolutional neural network (CNN) and object-oriented remote sensing technology. On the basis of geographic object-based image analysis, other works, like Reference [24], presented an artificial neural network (ANN) which integrated with particle swarm optimization (PSO) to enhance the learning process.
Based on these research, our work attempts to combine the GEOBIA and neural networks, and explores further in this direction.

Remote Sensing with GNN
As an important branch of the deep learning family, the strategy based on graph neural networks [25,26] have grown more and more popular, which achieves the state-of-the-art performance in both graph feature extraction and classification. Among them, graph convolutional network (GCN) [27] plays an important role. Furthermore, Reference [28] proposes a novel neural network called graph attention networks (GATs) which can attend over their neighborhoods' features and specify different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. Inspired by these, applying deep neural networks to graph structured data has recently been of interest to the vision community. For example, approaches, such as References [29,30], tried to generalize convolution layers to the graphs. Other works, like References [31,32], attempted to learn knowledge graphs and use graphs for visual reasoning.
However, the current GNN networks directly stacking more layers will bring the problem of over-smoothing, which drives the output of GCN towards a space that contains limited distinguished information among nodes, leading to a poor expressivity. To solve this problem, many researchers have recently conducted beneficial explorations. Some of them try to use used random-walk method [33] or restrict the neighborhood expansion size [34,35] to solve this issue. There are also some studies alleviate this issue by deleting the edges in the graph [36] or incorporate multi-hop neighboring context into attention computation [37].
Apart from these, combining a prior knowledge base with GNN models for vision tasks also becomes popular. Reference [38] used the knowledge graph to perform zero-shot classification. References [39][40][41] used the common-sense or structured prior knowledge to improve the performance of deep models. Other works, like References [42][43][44], tried to use graph embedding to learn some prior knowledge or relationships between label.
In the research of remote sensing image analysis, as hyperspectral images usually have large homogeneous regions, some research started to use graph neural networks (GNNs) to achieve remote sensing images analysis [45][46][47][48]. For example, Reference [45] investigated the use of graph convolutional networks (GCNs) in order to characterize spatial arrangement features for land use classification and [46] proposed a novel deep learning-based MLRSSC framework by combining graph neural network (GNN) and convolutional neural network (CNN) to mine the spatio-topological relationships of the scene graph. To further improve the detection accuracy, Reference [49] proposed a novel anomaly detection method based on texture feature extraction and a graph dictionary-based low rank decomposition (LRD).
Currently, some other scholars have also started to integrate the geographic prior knowledge into remote sensing analysis. Works, like Reference [50], presented a spatial location constraint for hyperspectral image classification, which is exploited to incorporate the prior knowledge of the location information of training pixels. Similarly, to further improve the recognition performance, Reference [51] proposed a simplified graph-based visual saliency model for airport detection in panchromatic remote sensing images, which introduced the concept of near parallelity for the first time and treated it as prior knowledge.
In summary, in remote sensing analysis research, the pixel-based methods are computationally expensive and cannot contain object-based spatial relationships and geographic prior knowledge. Most current GNN methods can carry richer semantic and their computational complexities are relatively small, but they rarely consider geographic prior knowledge. Therefore, to better introduce the prior knowledge into remote sensing analysis, it still needs further exploration.

Methodology
In this chapter, the overall structure of the KSPGAT network is introduced first in Section 3.1. To solve the problem of "different objects with the same spectrum", a novel spatial pyramid distance algorithm is presented in Section 3.3. Further research shows that only rely on the spatial relationships between geographic objects cannot effectively solve the problem of "violating the first law of geography". For this reason, a gating mechanism which embeds geographic prior knowledge into the GAT model is then introduced in Sections 3.4 and 3.5, and it can solve the two problems to some extent. Finally, considering the problem of over-smoothing in graph neural networks, we design the network depth into two layers, and incorporate the effect of co-occurrence knowledge in the loss function in Section 3.6.

Network Structure
The KSPGAT network is designed as the encoder-decoder structure and consists of four modules, of which the superpixel clustering module, feature extraction module, and spatial correlation recognition algorithm together constitute the encoder, the knowledgebased gating mechanism is the decoder. The details are shown as follows: As shown in Figure 4, in our network, a super-pixel clustering module and a feature extraction module are proposed to obtain the object features from the input remote sensing image first. Then, a spatial pyramid distance recognition algorithm is provided to recognize the spatial relationship between objects. Finally, on the basis of fusing the feature similarity and spatial relationship, a multi-source attention mechanism and a gating mechanism are presented to aggregate neighbor nodes more accurately. Specifically, the network contains the following modules.

Superpixel Clustering Module
The region merging segmentation method proposed in References [52][53][54] is used to perform superpixel clustering of remote sensing images, so each sample can obtain several superpixel blocks. Then, these superpixel blocks are used to provide masks for feature extraction module and spatial correlation recognition algorithm.

Feature Extraction Module
The feature extraction module of the network takes the remote sensing image and the object mask as input to obtain object features. First, the remote sensing image is extracted through a convolutional neural network (CNN) to obtain the global feature. Then, the mask is used to obtain the feature of each object. Finally, the node of the graph neural network is generated through the object feature.

Spatial Correlation Recognition Algorithm Based on Spatial Pyramid Distance and Multi-Source Attention Mechanism
In this module, we propose a novel spatial correlation recognition algorithm. First, we design the location coding method based on pyramid pooling to obtain the location coding vector of each object, and then we use the vector to identify the spatial distance between objects, which is simple and efficient enough to generate the spatial correlation between nodes. Finally, the similarity of features and the spatial correlation between nodes are combined to design and implement the multi-source attention mechanism.

Gating Mechanism Based on Prior Knowledge of Category Co-Occurrence
The gating mechanism uses category co-occurrence knowledge to train control gates corresponding to different spatial pyramid distances (spatial relations). It realizes the aggregation method that integrates knowledge, spatial relations and node similarity, thereby improving the classification accuracy of the central node.
ing image first. Then, a spatial pyramid distance recognition algorithm is provided to recognize the spatial relationship between objects. Finally, on the basis of fusing the feature similarity and spatial relationship, a multi-source attention mechanism and a gating mechanism are presented to aggregate neighbor nodes more accurately. Specifically, the network contains the following modules.

Superpixel Clustering Module
The region merging segmentation method proposed in References [52][53][54] is used to perform superpixel clustering of remote sensing images, so each sample can obtain several superpixel blocks. Then, these superpixel blocks are used to provide masks for feature extraction module and spatial correlation recognition algorithm.

Feature Extraction Module
The feature extraction module of the network takes the remote sensing image and the object mask as input to obtain object features. First, the remote sensing image is extracted through a convolutional neural network (CNN) to obtain the global feature. Then, the mask is used to obtain the feature of each object. Finally, the node of the graph neural network is generated through the object feature.

Spatial Correlation Recognition Algorithm Based on Spatial Pyramid Distance and Multi-Source Attention Mechanism
In this module, we propose a novel spatial correlation recognition algorithm. First, we design the location coding method based on pyramid pooling to obtain the location coding vector of each object, and then we use the vector to identify the spatial distance between objects, which is simple and efficient enough to generate the spatial correlation between nodes. Finally, the similarity of features and the spatial correlation between nodes are combined to design and implement the multi-source attention mechanism.

Gating Mechanism Based on Prior Knowledge of Category Co-Occurrence
The gating mechanism uses category co-occurrence knowledge to train control gates corresponding to different spatial pyramid distances (spatial relations). It realizes the aggregation method that integrates knowledge, spatial relations and node similarity, thereby improving the classification accuracy of the central node.  The following subsections will expand on each module of the network.

Superpixel Clustering Module and Feature Extraction Module
First, the region merging segmentation method proposed in References [52][53][54] is used to perform super-pixel clustering of remote sensing images so each sample can obtain several superpixel blocks. Each superpixel block is a geographical object with geographic semantics, which can be used as a H 4 × W 4 mask. Then, the feature extraction module is adopted to extract object features with this mask. The following, Figure 5, shows the structure of feature extraction module: The following subsections will expand on each module of the network.

Superpixel Clustering Module and Feature Extraction Module
First, the region merging segmentation method proposed in References [52][53][54] is used to perform super-pixel clustering of remote sensing images so each sample can obtain several superpixel blocks. Each superpixel block is a geographical object with geographic semantics, which can be used as a × mask. Then, the feature extraction module is adopted to extract object features with this mask. The following, Figure 5, shows the structure of feature extraction module: As shown in Figure 5, the feature extraction module first takes the remote sensing image (W × H × 3) and the object masks (N × W/4 × H/4) as input, where N is the number of objects. Then, the remote sensing image is passed through a CNN to obtain the global feature, which is multiplied by the mask to obtain the masked object features(N × W/4 × H/4 × C), where C is the hidden dimension. The masked object features are sent to two branches, respectively, and then concatenated to obtain the final feature , which has dimension N × (C + 49), and is used as the initial node feature ℎ of the graph network.
The calculation formula of this module is as follows: where x is the input image, F(.) is the global feature extraction function, which consists of two convolutional layers and pooling layers, Mask is the object mask, and G(.) is the feature extraction function, which consists of two branches, one a global average pooling (GAP) layer, and the other a max pooling layer with convolutional layer.

The Spatial Correlation between Objects: The Spatial Pyramid Distance
In this section, we introduce a novel location encoding method based on pyramid pooling, and on this basis, we propose the spatial pyramid distance to represent the spatial correlation between objects. The details are as follows.

The Location Encoding Method Based on Pyramid Pooling
Pyramid pooling is used to encode the two-dimensional position information of the object mask. The object mask is passed through 3-level pyramid average pooling (28 × 28, 14 × 14, 7 × 7) to obtain multiscale spatial location features. We use the area ratio of the mask in the pooling kernel to calculate the pooled value, which includes the semantic of object's area. The formula of average pooling is as follows: As shown in Figure 5, the feature extraction module first takes the remote sensing image (W × H × 3) and the object masks (N × W/4 × H/4) as input, where N is the number of objects. Then, the remote sensing image is passed through a CNN to obtain the global feature, which is multiplied by the mask to obtain the masked object features (N × W/4 × H/4 × C), where C is the hidden dimension. The masked object features are sent to two branches, respectively, and then concatenated to obtain the final feature Feat, which has dimension N × (C + 49), and is used as the initial node feature h of the graph network.
The calculation formula of this module is as follows: where x is the input image, F(.) is the global feature extraction function, which consists of two convolutional layers and pooling layers, Mask is the object mask, and G(.) is the feature extraction function, which consists of two branches, one a global average pooling (GAP) layer, and the other a max pooling layer with convolutional layer.

The Spatial Correlation between Objects: The Spatial Pyramid Distance
In this section, we introduce a novel location encoding method based on pyramid pooling, and on this basis, we propose the spatial pyramid distance to represent the spatial correlation between objects. The details are as follows.

The Location Encoding Method Based on Pyramid Pooling
Pyramid pooling is used to encode the two-dimensional position information of the object mask. The object mask is passed through 3-level pyramid average pooling (28 × 28, 14 × 14, 7 × 7) to obtain multiscale spatial location features. We use the area ratio of the mask in the pooling kernel to calculate the pooled value, which includes the semantic of object's area. The formula of average pooling is as follows: where k is the size of pooling kernel, Mask is the object mask, p is the start index in row, q is the start index in column, and c pq is the value after pooling. Then, the location features are encoded into a one-dimensional vector by quadtreeencoding. The vector e can be defined as follows: where k represents the size of pooling kernel, and k ∈ 28, 14, 7. {.} represents the quadtreeencoding, concat(.) means vector concatenation. In this way, the approximate location of each object can be obtained, and the value of each grid represents the semantic of object's area. The following, Figure 6, shows two object masks' position vector after spatial pyramid pooling. where k is the size of pooling kernel, is the object mask, p is the start index in row, q is the start index in column, and is the value after pooling. Then, the location features are encoded into a one-dimensional vector by quadtreeencoding. The vector can be defined as follows: where k represents the size of pooling kernel, and k ∈ 28,14,7. {.} represents the quadtreeencoding, (. ) means vector concatenation. In this way, the approximate location of each object can be obtained, and the value of each grid represents the semantic of object's area. The following, Figure 6, shows two object masks' position vector after spatial pyramid pooling. After the pooling of three scales, there exists a significant difference between the encoding vectors of the two objects with long distance, which is conducive to the subsequent spatial distance recognition.

Spatial Pyramid Distance
On the basis of the above location encoding method, to describe the spatial relationships between nodes, we define the spatial pyramid distance to discretely represent them. According to the number of other interval objects between node i and j, interval_number = 0 means that the two nodes are directly adjacent, interval_number = 1 means that there is only one object between them, and interval_number ≥ 2 means that there are multiple objects between them. Based on this, the spatial pyramid distance _ between node i and j is defined as Table 1: Table 1. The value of spatial pyramid distance. After the pooling of three scales, there exists a significant difference between the encoding vectors of the two objects with long distance, which is conducive to the subsequent spatial distance recognition.

Spatial Pyramid Distance
On the basis of the above location encoding method, to describe the spatial relationships between nodes, we define the spatial pyramid distance to discretely represent them. According to the number of other interval objects between node i and j, interval_number = 0 means that the two nodes are directly adjacent, interval_number = 1 means that there is only one object between them, and interval_number ≥ 2 means that there are multiple objects between them. Based on this, the spatial pyramid distance sp_map ij between node i and j is defined as Table 1: Therefore, the three different values of sp_map ij represent spatial adjacency, spatial separation (near), and spatial separation (far), respectively.

The Spatial Correlation Recognition Algorithm Based on Spatial Pyramid Distance
Based on the location encoding method and spatial pyramid distance, we design a spatial correlation recognition algorithm. The algorithm is used to describe the spatial distance between geographical objects discretely according to three different values, where 1 represents spatial adjacency, 2 represents spatial separation (near) and 3 represents spatial separation (far).
The algorithm is denoted by the Algorithm 1: Algorithm 1 For recognizing the spatial pyramid distance Input: mask of object i mask i , mask of object j mask j Output: distance feature vector sp_vec ij , spatial pyramid distance sp_map ij 1 Begin 2 For t← 1 to 3 step ←1; do 3 k ← 56 2 t ; // calculate the pooling size k 4 e i k ← avgpooling k (mask i ) ; // encode the position vector of mask i with pooling size k 5 e j k ← avgpooling k mask j ; // encode the position vector of mask j with pooling size k 6 End For 7 concatenate all e i k to obtain the multiscale location features e i of mask i ; 8 concatenate all e j k to obtain the multiscale location features e j of mask j ; 9 v ij ← e i − e j ; // subtract the position encoding vectors of mask i and mask j 10 Return sp_vec ij , sp_map ij ; 13 End Then, we design a network to accomplish the algorithm, the structure of the network is as Figure 7: Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 31 Therefore, the three different values of _ represent spatial adjacency, spatial separation (near), and spatial separation (far), respectively.

The Spatial Correlation Recognition Algorithm Based on Spatial Pyramid Distance
Based on the location encoding method and spatial pyramid distance, we design a spatial correlation recognition algorithm. The algorithm is used to describe the spatial distance between geographical objects discretely according to three different values, where 1 represents spatial adjacency, 2 represents spatial separation (near) and 3 represents spatial separation (far).
The algorithm is denoted by the Algorithm 1: Algorithm 1 For recognizing the spatial pyramid distance Input: mask of object , mask of object Output: distance feature vector _ , spatial pyramid distance _ Then, we design a network to accomplish the algorithm, the structure of the network is as Figure 7: The network takes the mask of each pair of objects as input, which can be expressed as (N × N) × × × 2. Then, the location encoding method based on pyramid pooling is used to obtain the spatial location vectors at different scales. To fuse these features, they are, respectively, coded into one-dimensional vectors and obtain the multiscale location features Position Features. Then, two Position Features subtract to obtain the distance features and passes through a Multi-layer Perceptron (MLP) layer to obtain the distance feature vector Sp_Vec between each pair of objects, and finally passes through another MLP layer to obtain the spatial pyramid distance Sp_Map. The network takes the mask of each pair of objects as input, which can be expressed In our dataset, the overall accuracy of the algorithm reaches 83.6%, which can basically describe the spatial distance between objects. Therefore, it can provide a favorable basis for subsequent node classification.
Compared with the spatial relationship analysis method in GIS or other neural networks, our method contains the following advantages: (a) As it uses the area ratio of the mask in the pooling kernel to calculate the pooled value, it contains richer semantics of object's area; (b) Unlike the method based on the centroid distance between objects, it is easy to be implemented; (c) Compared with other neural networks based on object detection, it does not require complex computations, which reduces the network pressure and therefore improves the generalization ability.
In summary, this algorithm can efficiently construct the adjacency matrix between nodes and quickly provide reliable input for the graph neural network. Based on the algorithm, the KSPGAT network then combines the spatial relationship and the similarity of features to design and implement the following multi-source attention mechanism.

Multi-source Attention Mechanism Based on Similarity of Spectral Features and Spatial Relationships of Geographic Objects
In this section, we first analyze the problem of the attention mechanism in the baseline GAT, and then we propose our multi-source attention mechanism.

Attention Mechanism in the Baseline GAT
The attention mechanism in the baseline GAT only considers the similarity of the spectral features between nodes. Assuming that the original node feature is h, first the node feature is multiplied by a weight W h to project the feature into a new space. Then, to consider the feature similarity between nodes i and j, the transformed features of two nodes are concatenated, and the feature similarity is calculated through a weight a. Finally, the correlation between the two nodes, which is defined as α ij , can be obtained through a softmax function. The specific calculation formula is as follows: where W h is a learnable weight to project the spectral features of nodes, || represents vector concatenation, softmax is the normalized function by row, and s is the scaling factor to prevent the weight of neighbours from being too small. As this mechanism is more rely on spectral similarity to aggregate neighbor nodes, it cannot solve the problem of "different objects with the same spectrum". Therefore, it is necessary to integrate spatial relationships into the attention mechanism to improve it. The improved algorithm is as follows:

Multi-Source Attention Mechanism Based on Geographic Object Feature Similarity and Pyramid Distance
In the multi-source attention mechanism, to solve the problem of "different objects with the same spectrum", we consider not only the similarity of the spectral features between the geographic objects, but also the spatial pyramid distance between them. We multiply the distance feature Sp_Vec ij of nodes i and j by a weight W s to project the distance feature into the same space as spectral features and then concatenate it with the transformed spectral features of nodes i and j. Therefore, the algorithm for calculating attention becomes as follows: where W h is a learnable weight to project the spectral features of nodes, W s is a learnable weight to project the distance features, || represents vector concatenation, softmax is the normalized function by row, and s is the scaling factor to prevent the weight of neighbours from being too small. However, due to the problem of "violating the first law of geography" discussed in Section 1, the aggregations of neighbor nodes are still under great limitations even considering the spatial relationship. At this time, the following gating mechanism based on category co-occurrence knowledge is represented to control the aggregation of neighbor nodes more accurately.

Knowledge-Based Gating Mechanism
To overcome the problem of "violating the first law of geography" caused by the distortion of the spatial relationship at the corner of samples in some area, we first summary the co-occurrence probability between different categories from the whole dataset and then design a gated graph attention network which uses the co-occurrence probability to expand the receptive field of the object from the specific sample to the whole research area. This corrects some distortion problems of spatial relationship, thereby improving the accuracy of remote sensing semantic segmentation. Furthermore, through the gating mechanism, the neighbor nodes are filtered based on the prior knowledge; thereby, the mechanism can avoid the problem of over-smoothing to a certain extent.

Category Co-Occurrence Knowledge in the Sample Set
Category co-occurrence means the probability of two categories appearing in a scene at the same time. As shown in Figure 8 below, M is the category co-occurrence matrix, which represents co-occurrence between each category. The size of M is C × C, where C is the number of categories. M ij represents the proportion of samples S ij with both categories i and j among all samples S i with category i, i.e., M ij = S ij /S i . The figure below shows the category co-occurrence probability matrix, which means the probability of two categories appearing in a scene at the same time.
Remote Sens. 2021, 13, x FOR PEER REVIEW 11 of 31 normalized function by row, and s is the scaling factor to prevent the weight of neighbours from being too small. However, due to the problem of "violating the first law of geography" discussed in Section 1, the aggregations of neighbor nodes are still under great limitations even considering the spatial relationship. At this time, the following gating mechanism based on category co-occurrence knowledge is represented to control the aggregation of neighbor nodes more accurately.

Knowledge-Based Gating Mechanism
To overcome the problem of "violating the first law of geography" caused by the distortion of the spatial relationship at the corner of samples in some area, we first summary the co-occurrence probability between different categories from the whole dataset and then design a gated graph attention network which uses the co-occurrence probability to expand the receptive field of the object from the specific sample to the whole research area. This corrects some distortion problems of spatial relationship, thereby improving the accuracy of remote sensing semantic segmentation. Furthermore, through the gating mechanism, the neighbor nodes are filtered based on the prior knowledge; thereby, the mechanism can avoid the problem of over-smoothing to a certain extent.

Category Co-Occurrence Knowledge in the Sample Set
Category co-occurrence means the probability of two categories appearing in a scene at the same time. As shown in Figure 8 below, M is the category co-occurrence matrix, which represents co-occurrence between each category. The size of M is C × C, where C is the number of categories. M represents the proportion of samples S with both categories i and j among all samples S with category , i.e., M = / . The figure below shows the category co-occurrence probability matrix, which means the probability of two categories appearing in a scene at the same time. As discussed in Section 1, due to the limitation of the sample cutting in some areas, the neighbors of some nodes in the sample may not be complete and accurate, which may cause the distortion of the spatial relationship between the nodes and their neighbors. The co-occurrence relationship is a more universal relationship obtained according to the cooccurrence probability in the dataset which describe the general geographic prior knowledge of the research area. Therefore, the weight of the nodes can be strengthened As discussed in Section 1, due to the limitation of the sample cutting in some areas, the neighbors of some nodes in the sample may not be complete and accurate, which may cause the distortion of the spatial relationship between the nodes and their neighbors. The co-occurrence relationship is a more universal relationship obtained according to the co-occurrence probability in the dataset which describe the general geographic prior knowledge of the research area. Therefore, the weight of the nodes can be strengthened or weakened according to the co-occurrence probability. This can correct some distortion problems of spatial relationship, and alleviate the problem of "violating the first law of geography", thus make the aggregation of neighbor nodes more accurate.
Based on the above-mentioned category co-occurrence knowledge, we then design the following gated graph attention network and use the category co-occurrence GT to train the control gate.

Gated Graph Attention Network Based on Category Co-Occurrence Prior Knowledge
In this section, we introduce our novel gated graph attention network. In the network, we divide all neighbor nodes into k groups according to their spatial correlation with the central node. For each group, we design a gating mechanism. During the training phase, the co-occurrence knowledge of two categories are used to control the aggregation of neighbor nodes, thereby correcting the problems of "violating the first law of geography". Furthermore, through the gating mechanism, the neighbor nodes are filtered based on the prior knowledge; thereby, the mechanism can avoid the problem of over-smoothing to a certain extent.
(1) Structure Our gated graph attention network combines the multi-source attention mechanism, and is designed with a gating mechanism that integrates category co-occurrence knowledge to control the aggregation of neighbor nodes more accurate. Figure 9 below shows the structure of a head of the gated graph attention network.
Remote Sens. 2021, 13, x FOR PEER REVIEW 12 of 31 or weakened according to the co-occurrence probability. This can correct some distortion problems of spatial relationship, and alleviate the problem of "violating the first law of geography", thus make the aggregation of neighbor nodes more accurate. Based on the above-mentioned category co-occurrence knowledge, we then design the following gated graph attention network and use the category co-occurrence GT to train the control gate.

Gated Graph Attention Network Based on Category Co-Occurrence Prior Knowledge
In this section, we introduce our novel gated graph attention network. In the network, we divide all neighbor nodes into k groups according to their spatial correlation with the central node. For each group, we design a gating mechanism. During the training phase, the co-occurrence knowledge of two categories are used to control the aggregation of neighbor nodes, thereby correcting the problems of "violating the first law of geography". Furthermore, through the gating mechanism, the neighbor nodes are filtered based on the prior knowledge; thereby, the mechanism can avoid the problem of over-smoothing to a certain extent.
(1) Structure Our gated graph attention network combines the multi-source attention mechanism, and is designed with a gating mechanism that integrates category co-occurrence knowledge to control the aggregation of neighbor nodes more accurate. Figure 9 below shows the structure of a head of the gated graph attention network. As the major part of the KSPGAT network, for a central node , the gated graph attention network takes its node feature h , the node feature h of its neighbor node ( ), their distance vector Sp_Vec and spatial pyramid distance Sp_Map as the input. Then, the multi-source attention mechanism is combined and the gating mechanism is designed to aggregate the features of neighbor nodes. During the training phase, the prior knowledge of category co-occurrence is integrated into the gating mechanism by using category co-occurrence ground truth (GT) to supervise and train the control gates.
To better represent the co-occurrence relationship between nodes, we design the cooccurrence knowledge according to the probability and in the category co-occurrence matrix; if max , ≥ 0.5, it can then consider that category and category have a co-occurrence relationship. At this time, the GT of the co-occurrence control gate is set to 1, which indicates that the gate is opened; otherwise, there is no co-occurrence relationship, and the GT is marked as 0 to indicate that the gate is closed. As the major part of the KSPGAT network, for a central node i, the gated graph attention network takes its node feature h i , the node feature h j of its neighbor node j (j = i), their distance vector Sp_Vec ij and spatial pyramid distance Sp_Map ij as the input. Then, the multi-source attention mechanism is combined and the gating mechanism is designed to aggregate the features of neighbor nodes. During the training phase, the prior knowledge of category co-occurrence is integrated into the gating mechanism by using category co-occurrence ground truth (GT) to supervise and train the control gates.
To better represent the co-occurrence relationship between nodes, we design the cooccurrence knowledge according to the probability P ij and P ji in the category co-occurrence matrix; if max P ij , P ji ≥ 0.5, it can then consider that category i and category j have a co-occurrence relationship. At this time, the GT of the co-occurrence control gate is set to 1, which indicates that the gate is opened; otherwise, there is no co-occurrence relationship, and the GT is marked as 0 to indicate that the gate is closed.
(2) Aggregation with multi-group Based on K (K = 3) spatial distances which are adjacent, separated (near), and separated (far), respectively, we divide all neighbor nodes into k groups. Therefore, K kinds of GAT heads are designed to aggregate neighbor nodes of different spatial relationships. For the GAT head of each spatial relationship, a gate which combines the category co-occurrence knowledge is designed to control the process of neighbor nodes aggregation. The following, Figure 10, shows the aggregation of the network. (2) Aggregation with multi-group Based on K (K = 3) spatial distances which are adjacent, separated (near), and separated (far), respectively, we divide all neighbor nodes into k groups. Therefore, K kinds of GAT heads are designed to aggregate neighbor nodes of different spatial relationships. For the GAT head of each spatial relationship, a gate which combines the category cooccurrence knowledge is designed to control the process of neighbor nodes aggregation. The following, Figure 10, shows the aggregation of the network. As shown in Figure 10, network divides all nodes into k groups according to their spatial correlation with the central node to form a new GAT head with a gating mechanism, called gate_head.
The following, Figure 11, shows the flow chart of aggregation. As shown in Figure 10, network divides all nodes into k groups according to their spatial correlation with the central node to form a new GAT head with a gating mechanism, called gate_head.
The following, Figure 11, shows the flow chart of aggregation. As shown in Figure 11, during node aggregation, the node features h, the distance vector Sp_Vec and the spatial pyramid distance Sp_Map are input into the gated graph attention network to generate the new node features h . N is the number of the nodes, and K is 3, which represents three kind of spatial distances. N k represents the number of the neighbor nodes with a spatial distance k.
For each central node i and one of its neighbour nodes j (j = i), the spatial relationship between them is k, the aggregation of node i is as follows: (a) First, head k ij which can aggregate the information of node j to the central node i is calculated with h i , h j , sp_vec ij , sp_map ij .
(b) Then, gate k ij is calculated with h i , h j , sp_vec ij , sp_map ij . If nodes i and j have a co-occurrence relationship, the gate k ij will be opened; otherwise, it will be closed. (c) Finally, the new node features h i of the central node i is calculated with head k ij and gate k ij of all neighbor nodes. In summary, the control gate is calculated from the node feature, the embedded distance feature and the spatial pyramid distance. The specific formula is as follows: where W_G k h and W_G k s are learnable weights in the gate that controls the k th GAT head, N i represents all neighbour nodes of node i, W_H k h is a learnable weight in the k th GAT head, and α k ij is the attention weight calculated by the multi-source attention mechanism in the k th GAT head. K = 3 represents three spatial pyramid distances, and k = 1, 2, 3 indicates the spatial pyramid distance of 1, 2, and 3, respectively.
(3) Discussion Different from the traditional attention aggregation mechanism, the KSPGAT network adopts multi-group aggregation mechanism with spatial relation. The gate_head divides all neighbor nodes into k groups according to their spatial correlation with the central node, and uses gates that integrate category co-occurrence knowledge to control the aggregation of neighbor nodes. Comparing with the traditional attention mechanism, the multi-group aggregation mechanism has the following advantages: (a) Each group only contains the neighbor nodes that have the same spatial correlation with the central node. The neighbor nodes of each group have certain commonalities, and the head does not have parameter redundancy, which can reduce the training pressure; (b) The gate controlled the process of aggregation neighbor nodes by integrating prior knowledge, and the controllability of the network and the interpretability of the results are improved.
(c) Through the gating mechanism, the neighbor nodes are filtered based on the prior knowledge; thereby, the mechanism can avoid the problem of over-smoothing to a certain extent. As shown in Figure 11, during node aggregation, the node features h, the distance vector Sp_Vec and the spatial pyramid distance Sp_Map are input into the gated graph attention network to generate the new node features ℎ . N is the number of the nodes, and K is 3, which represents three kind of spatial distances.
represents the number of the neighbor nodes with a spatial distance k.
For each central node and one of its neighbour nodes ( ), the spatial relation- Figure 11. The flow chart of aggregation.

Network Depth (Number of Aggregation) and Loss Function
In this section, we focus on the analysis of network depth and loss function in the KSPGAT network.

Depth of KSPGAT Network
Similar to the baseline GAT network, our KSPGAT network aggregates twice in total: In the first aggregation we adopt the same multi-head structure as the baseline GAT network. Specifically, use P (in our experiment P = 2) independent gate_head 1 p with gating mechanism to process the input node features h and then concatenate their output features together as the input feature h 1 of the second aggregation. In the second aggregation, only an independent gate_head 2 with gating mechanism is used to process the new input feature h 1 to obtain the output feature h 2 , and, finally, a softmax nonlinear function is used to obtain the classification probability o. The formula is as follows: where h is the original node feature, and P is the number of heads in the multihead structure. h 1 is the output feature of the first aggregation and also the input feature of the second aggregation, h 2 is the output of the second aggregation, and o is final classification probability.

Co-Occurrence Knowledge Embedding Loss
To implicitly embed the co-occurrence knowledge into the control gate, we added a cooccurrence knowledge embedding loss in our network in addition to the node classification loss. The co-occurrence knowledge embedding loss adjusts the network parameters by calculating the mean square error between the value of the co-occurrence knowledge gate and co-occurrence knowledge GT between nodes. The specific calculation formula of the co-occurrence knowledge embedding loss loss gate and the node classification Loss loss cls is as follows: Furthermore, to balance the node classification loss and the co-occurrence knowledge embedding loss, we introduce a balance factor λ. Generally, λ is the maximum ratio of the two loss thresholds. In the experiment, we set λ = 10. Therefore, the total loss of the network is calculated as follows:

Experiment
In this chapter, the introduction of research areas and samples is presented in Section 4.1 first. Then, we show the parameters of the networks which are involved in our experiment in Section 4.2. The overall accuracy comparison is conducted in Section 4.3. Finally, the training process and loss curve are shown in Section 4.4.

Introduction of Research Areas and Samples
This study involves Wenchuan County, Sichuan Province, and the surrounding areas. The latitude ranges from 30 • 45 to 31 • 43 and the longitude ranges from 102 • 51 to 103 • 44 . We selected a total of 1680 patches from the research area. To make the number of samples in the validation set and training set sufficient and the results reasonable, we randomly assigned 1280 samples as the training set and the other 400 samples as the validation set. Each sample includes a 224 × 224 remote sensing image, a manually classified GT image of the same size, and an object mask obtained using the open source algorithms [52][53][54].

Network Parameters
The networks that are involved in the experiment include a U-Net network, a baseline GAT network, a multi-source GAT network and a KSPGAT network incorporating prior knowledge and spatial pyramid distance.
According to the previous experiment, when we train the U-Net network, the feature dimension in the bottom layer is set to 512, the batch_size is set to 32, the learning rate is set to 3e−4, and the training epochs is 250; The hidden dimensions of the baseline GAT network, multi-source GAT network and KSPGAT network are all 128, and are all aggregated twice. The baseline-based GAT network and the multi-source GAT network use 4 GAT heads for the first aggregation, and a single GAT head for the second aggregation; the KSPGAT network uses 2 gate_head with a gating mechanism in the first aggregation, and a single gate_head in the second aggregation. The batch_size of the three graph neural networks are all 1, the learning rates are all 1e−3, and the training epochs are all 600.

Overall Accuracy Comparison
To compare with the results of U-Net network, we convert the results of the three graph neural networks from objects to pixels. The classification results of the four networks on the validation set are as follows: Tables 2-5 are pixel-based classification confusion matrices of the U-Net network, baseline GAT network, multi-source GAT network, and KSPGAT network, respectively.    Additionally, we further compare the four networks in other classification metrics, and the results are as Table 6. It can be seen that the KSPGAT network also has a significant improvement in Accuracy, Mean Intersection over Union (MIOU), Kappa, and F1-Score compared to the other three networks.
By comparing the pixel-based classification results of the four networks, it can be seen that the classification results of the baseline GAT network and the U-Net network are similar. The overall accuracy of the U-Net network is 86.7%, and the baseline GAT network is 87.3%. Compared with the baseline GAT network, the multi-source GAT network increases its overall accuracy by 1.3%, reaching 88.6%. The most significant improvement has been made in the KSPGAT network, in which overall accuracy has been increased by 3.8% compared with the baseline GAT network, reaching 91.1%.
Further analysis shows that the classification accuracy of the baseline GAT network in the categories of village, path and forest is significantly higher than that of the U-Net network. However, these two networks have the following problems: (1) The accuracy is low in the categories of city_forest and city_grass, which are prone to be confused with forest and grass; (2) The accuracies of categories with a small number of samples are relatively low; The following, Table 7, shows the accuracy comparison of the four networks in some categories. From the above comparison, it can be seen that the classification accuracy of the U-Net network and the baseline GAT network on the city_grass and the city_forest is relatively low. Among them, the classification accuracies of the U-Net network in city_grass and city_forest are 28.7% and 25.1%, respectively. The classification accuracies of the baseline GAT network in city_grass and city_forest are 66.9% and 60.8%, respectively.
The multi-source GAT network has a slight improvement compared with the previous two networks, but the classification accuracies of city-grass and city-forest are still low, only 34.4% and 73.6%, respectively.
Compared with the baseline GAT network, the KSPGAT network, which integrates spatial pyramid distance and co-occurrence prior knowledge, has improved the classification accuracy of city_grass from 25.1% to 61.9%, and the classification accuracy of city_forest has increased from 60.8% to 85.7%.
After comparative analysis, it can be seen that the KSPGAT network with obvious advantages can greatly improve the classification accuracy of city-grass and city-forest by incorporating the spatial pyramid distance and co-occurrence prior knowledge.
To verify the stability of our KSPGAT network, we randomly allocate the total samples to the training and validation sets in the same proportions as the previous experiments, and performed 10 independent Monte Carlo runs. We compared the Accuracy, mIOU, Kappa, and F1-Score of the 10 experiments, where the trend is shown in Figure 12: spatial pyramid distance and co-occurrence prior knowledge, has improved the classification accuracy of city_grass from 25.1% to 61.9%, and the classification accuracy of city_forest has increased from 60.8% to 85.7%.
After comparative analysis, it can be seen that the KSPGAT network with obvious advantages can greatly improve the classification accuracy of city-grass and city-forest by incorporating the spatial pyramid distance and co-occurrence prior knowledge.
To verify the stability of our KSPGAT network, we randomly allocate the total samples to the training and validation sets in the same proportions as the previous experiments, and performed 10 independent Monte Carlo runs. We compared the Accuracy, mIOU, Kappa, and F1-Score of the 10 experiments, where the trend is shown in Figure 12: In the 10 experiments, the mean values of the Accuracy, mIOU, Kappa, and F1-Score were 0.9095, 0.8449, 0.9151, and 0.9138, respectively, and the standard deviations were 0.00259, 0.00287, 0.00290, and 0.00169, respectively, which proved the stability and reliability of the experimental results. In the 10 experiments, the mean values of the Accuracy, mIOU, Kappa, and F1-Score were 0.9095, 0.8449, 0.9151, and 0.9138, respectively, and the standard deviations were 0.00259, 0.00287, 0.00290, and 0.00169, respectively, which proved the stability and reliability of the experimental results.
Besides, to compare the performance and system resource requirements of the four networks, we also conducted a benchmark test. With the hardware environment of RTX 3080 GPU, we used the same validation set to test Params, which means the model size; Mem, which means the training GPU memory consumption; FLOPs, which means the calculation amount; and Inf time, which means the inference speed of model. We conducted the benchmark test in this environment for 10 times and took the average results as the final test results. The benchmark test results are shown as Table 8. Further, to check the value of our method, we used another remote sensing semantic segmentation dataset, called Gaofen Image Dataset (GID) [55], for experimental comparison. The dataset contains 10 pixel-level annotated GF-2 images, which has two more categories than our previous dataset and is made up of 15 categories: paddy field, irrigated land, dry cropland, garden land, arbor forest, shrub land, natural meadow, artificial meadow, industrial land, urban residential, rural residential, traffic land, river, lake, and pond. Since the 10 images in the GID dataset come from different regions and cover a geographic area of 506 km 2 , to make a fair comparison with our previous experiment, we cut 1800 samples with a size of 224 × 224 to compose the new dataset in which size is similar with our previous dataset, and we allocated the training set and the validation set in a ratio of 7:3 according to the principle of random allocation, which is consistent with our previous experiment. Therefore, we obtained 1260 samples as training data and the remaining 540 samples as validation data. Then, we recalculated the category co-occurrence probability in the new dataset. Finally, we trained the four models on the training set with the same hyperparameters (batch size, learning rate, and training epochs) as the previous experiments and verified on the validation set. The results are shown in the following, Table 9. It can be seen that our KSPGAT model improves the overall accuracy by 3.1% compared with the baseline GAT network and 4.5% with the U-Net network. The results mean that in different regions and different seasons, using different satellite data with larger categories, the effects of our model are stable and reliable. And this illustrates the value of our model.

Training Process and Loss Curve
All four networks use Adam optimizer for training, and their loss curves on the validation set are as shown in Figure 13: As shown in Figure 13, the x-axis represents the number of network epochs, and the y-axis represents the loss during network training. The loss curve shows the convergence trend of the network. As the number of epochs increases, in the early training period, the curve oscillates and decreases, and it stabilizes towards the later stage. We train the four networks for 600 epochs, but the U-Net network is overfitting after 250 epochs. To ensure complete network convergence without overfitting, we choose 250 as the training epoch for the U-Net network, and 600 for the baseline GAT network, the multi-source GAT network, and the KSPGAT network.

Results
According to the two problems discussed in Section 1, we analyze the recognition capabilities of three object-based networks in the two problems separately, in which the analysis of the problem "different objects with the same spectrum" is conducted in Section 5.3, the analysis of the problem "violating the first law of geography" is shown in Section 5.4. Finally, the discussion of the three networks are represented in Section 5.5.
To show the advantages of the KSPGAT network, we analyze the classification effects of the four networks in some typical samples, as shown in Figure 14: As shown in Figure 13, the x-axis represents the number of network epochs, and the y-axis represents the loss during network training. The loss curve shows the convergence trend of the network. As the number of epochs increases, in the early training period, the curve oscillates and decreases, and it stabilizes towards the later stage. We train the four networks for 600 epochs, but the U-Net network is overfitting after 250 epochs. To ensure complete network convergence without overfitting, we choose 250 as the training epoch for the U-Net network, and 600 for the baseline GAT network, the multi-source GAT network, and the KSPGAT network.

Results
According to the two problems discussed in Section 1, we analyze the recognition capabilities of three object-based networks in the two problems separately, in which the analysis of the problem "different objects with the same spectrum" is conducted in Section 5.3, the analysis of the problem "violating the first law of geography" is shown in Section 5.4. Finally, the discussion of the three networks are represented in Section 5.5.
To show the advantages of the KSPGAT network, we analyze the classification effects of the four networks in some typical samples, as shown in Figure 14: Comparing the classification results of three object-based models, it can be found that sample I and sample II can be classified completely correct in all three networks. Sample III and sample IV cannot be correct classified in the baseline GAT network, but can be classified completely correct in both the multi-source GAT network and the KSPGAT network. Sample V and VI can only be classified completely correct in the KSPGAT network.
Further comparing, the classification effect of the three object-based graph neural networks is significantly better than that of the U-Net network. The U-Net network accomplishes segmentation pixel by pixel, which leads to the problem of salt-and-pepper phenomena. While three graph neural networks accomplish segmentation based on superpixel blocks, and are free from salt-and-pepper effects.
To check the value of our method in other dataset, we show the classification results of four models in GID [55]. The classification results are shown in Figure 15. Comparing the classification results of three object-based models, it can be found that sample I and sample II can be classified completely correct in all three networks. Sample III and sample IV cannot be correct classified in the baseline GAT network, but can be classified completely correct in both the multi-source GAT network and the KSPGAT network. Sample V and VI can only be classified completely correct in the KSPGAT network.
Further comparing, the classification effect of the three object-based graph neural networks is significantly better than that of the U-Net network. The U-Net network accomplishes segmentation pixel by pixel, which leads to the problem of salt-and-pepper phenomena. While three graph neural networks accomplish segmentation based on superpixel blocks, and are free from salt-and-pepper effects.
To check the value of our method in other dataset, we show the classification results of four models in GID [55]. The classification results are shown in Figure 15. Remote Sens. 2021, 13, x FOR PEER REVIEW 23 of 31 Figure 15. The classification result of four networks in the GID dataset.
As shown in Figure 15, our KSPGAT model also outperforms several other models in the GID dataset, which proves the value of our method in other different areas.
To analyze the advantages of the KSPGAT network during node aggregation, according to the two problems discussed in Section 1, we divide the several samples in Figure  13 to two groups, wherein one is disturbed by the problem of "different objects with the same spectrum", and the other is interfered by the problem of "violating the first law of geography".

The Problem of "Different Objects with the Same Spectrum" in Sample III, IV
The phenomenon of "different objects with the same spectrum" is a common problem. To correctly identify the objects which are disturbed by this problem, it often requires knowing the surrounding objects. The following, Figure 16, shows the samples with this problem: As seen in Figure 16, the spectrums of town object A and flat_field object B in (a) are similar; the spectrums of flat_field object C and city_grass object D in (c) are similar. Therefore, they all have the problem of "different objects with the same spectrum". Without the information of the surrounding environment, it is hard to identify their actual categories.

The Problem of "Violating the First Law of Geography" in Sample V and VI
Due to the limitation of the sample cutting in some areas, the neighbors of some nodes in the sample may not be complete and accurate, which will cause the distortion of the spatial relationships. The following, Figure 17, shows the samples with this problem. As shown in Figure 15, our KSPGAT model also outperforms several other models in the GID dataset, which proves the value of our method in other different areas.
To analyze the advantages of the KSPGAT network during node aggregation, according to the two problems discussed in Section 1, we divide the several samples in Figure 13 to two groups, wherein one is disturbed by the problem of "different objects with the same spectrum", and the other is interfered by the problem of "violating the first law of geography".

The Problem of "Different Objects with the Same Spectrum" in Sample III, IV
The phenomenon of "different objects with the same spectrum" is a common problem. To correctly identify the objects which are disturbed by this problem, it often requires knowing the surrounding objects. The following, Figure 16 As shown in Figure 15, our KSPGAT model also outperforms several other models in the GID dataset, which proves the value of our method in other different areas.
To analyze the advantages of the KSPGAT network during node aggregation, according to the two problems discussed in Section 1, we divide the several samples in Figure  13 to two groups, wherein one is disturbed by the problem of "different objects with the same spectrum", and the other is interfered by the problem of "violating the first law of geography".

The Problem of "Different Objects with the Same Spectrum" in Sample III, IV
The phenomenon of "different objects with the same spectrum" is a common problem. To correctly identify the objects which are disturbed by this problem, it often requires knowing the surrounding objects. The following, Figure 16, shows the samples with this problem: As seen in Figure 16, the spectrums of town object A and flat_field object B in (a) are similar; the spectrums of flat_field object C and city_grass object D in (c) are similar. Therefore, they all have the problem of "different objects with the same spectrum". Without the information of the surrounding environment, it is hard to identify their actual categories.

The Problem of "Violating the First Law of Geography" in Sample V and VI
Due to the limitation of the sample cutting in some areas, the neighbors of some nodes in the sample may not be complete and accurate, which will cause the distortion of the spatial relationships. The following, Figure 17, shows the samples with this problem. As seen in Figure 16, the spectrums of town object A and flat_field object B in (a) are similar; the spectrums of flat_field object C and city_grass object D in (c) are similar. Therefore, they all have the problem of "different objects with the same spectrum". Without the information of the surrounding environment, it is hard to identify their actual categories.

The Problem of "Violating the First Law of Geography" in Sample V and VI
Due to the limitation of the sample cutting in some areas, the neighbors of some nodes in the sample may not be complete and accurate, which will cause the distortion of the spatial relationships. The following, Figure 17, shows the samples with this problem. As seen, the forest object A in Figure 17a and the city_forest object B in Figure 17c are cut at the corner of the sample. As their spatial relationships with neighbors are distorted, they may be interfered by the problem of "violating the first law of geography".
In these two group of samples, we choose the flat_field object in sample III, and the forest object in sample V as the analysis targets. Among them: (a) The flat_field object in sample III is misclassified as the town in the baseline GAT due to the problem of "different objects with the same spectrum". However, in the multisource GAT and the KSPGAT which both consider the spatial relationship, it can be classified correctly.
(b) The forest object in sample V is misclassified as the city_forest in the baseline GAT and the multi-source GAT due to the problem of "violating the first law of geography". And it can only be classified correctly in KSPGAT which incorporate the category co-occurrence knowledge.
The following subsections are the specific analysis.

Analysis of the Problem "Different Objects with the Same Spectrum"
In this section, we will analyze the attention results of the node flat_field in sample III to compare the recognition capabilities of the three object-based models on samples which contains the objects disturbed by the problem of "different objects with the same spectrum". The following, Figure 18, shows the classification results and the object masks in sample III. The attention results of the node flat_field in sample III are shown in Table 10. As seen, the forest object A in Figure 17a and the city_forest object B in Figure 17c are cut at the corner of the sample. As their spatial relationships with neighbors are distorted, they may be interfered by the problem of "violating the first law of geography".
In these two group of samples, we choose the flat_field 1 object in sample III, and the forest 1 object in sample V as the analysis targets. Among them: (a) The flat_field 1 object in sample III is misclassified as the town in the baseline GAT due to the problem of "different objects with the same spectrum". However, in the multi-source GAT and the KSPGAT which both consider the spatial relationship, it can be classified correctly.
(b) The forest 1 object in sample V is misclassified as the city_forest in the baseline GAT and the multi-source GAT due to the problem of "violating the first law of geography". And it can only be classified correctly in KSPGAT which incorporate the category co-occurrence knowledge.
The following subsections are the specific analysis.

Analysis of the Problem "Different Objects with the Same Spectrum"
In this section, we will analyze the attention results of the node flat_field 1 in sample III to compare the recognition capabilities of the three object-based models on samples which contains the objects disturbed by the problem of "different objects with the same spectrum". The following, Figure 18, shows the classification results and the object masks in sample III. As seen, the forest object A in Figure 17a and the city_forest object B in Figure 17c are cut at the corner of the sample. As their spatial relationships with neighbors are distorted, they may be interfered by the problem of "violating the first law of geography".
In these two group of samples, we choose the flat_field object in sample III, and the forest object in sample V as the analysis targets. Among them: (a) The flat_field object in sample III is misclassified as the town in the baseline GAT due to the problem of "different objects with the same spectrum". However, in the multisource GAT and the KSPGAT which both consider the spatial relationship, it can be classified correctly.
(b) The forest object in sample V is misclassified as the city_forest in the baseline GAT and the multi-source GAT due to the problem of "violating the first law of geography". And it can only be classified correctly in KSPGAT which incorporate the category co-occurrence knowledge.
The following subsections are the specific analysis.

Analysis of the Problem "Different Objects with the Same Spectrum"
In this section, we will analyze the attention results of the node flat_field in sample III to compare the recognition capabilities of the three object-based models on samples which contains the objects disturbed by the problem of "different objects with the same spectrum". The following, Figure 18, shows the classification results and the object masks in sample III. The attention results of the node flat_field in sample III are shown in Table 10. The attention results of the node flat_field 1 in sample III are shown in Table 10. Through the attention results of the baseline GAT in sample III, it can be seen that for the misclassified flat_field 1 , besides itself, it mainly focuses on town 8 , and the attention weight is 0.89. The attention weights of the other nodes are relatively small, and the attention weight of flat_field 2 is especially small, at only 0.25. This shows that the spectrum of flat_field 1 is similar to that of town 8 . Therefore, when aggregating neighbor nodes, the flat_field 1 mainly considers the information of town 8 , and does not consider the information of flat_field 2 too much, which causes it to be misclassified as town in the baseline GAT network.

Analysis of the Multi-Source GAT Network
Compared to the baseline GAT network, the multi-source GAT network shows its advantages. After considering the spatial correlation, the attention weights of flat_field 2 and water_body 3 , in which spatial pyramid distance from flat_field 1 is 1, are increased. flat_field 2 rose from 0.25 to 0.67, and its relative ranking also increased from the sixth to second, while the attention weights of other nodes were reduced to varying degrees. Among them, the weight of town 8 , in which spatial pyramid distance from city_grass 1 is 3, is reduced from 0.89 to 0.32, and the relative ranking also decreases from second to fourth. Therefore, flat_field 1 mainly focuses on the category of flat_field besides itself, and it can finally be classified correctly in the multi-source GAT network.

Analysis of the KSPGAT Network
In this section, we first analyze the control gates of flat_field 1 and then investigate the final aggregation weight in the KSPGAT network. The control gate and aggregation weight of flat_field 1 are shown in Table 11. Through the analysis of three control gates' value of flat_field 1 , it can be seen that among the neighbour nodes: (a) The nodes with a spatial pyramid distance of 1 from flat_field 1 are flat_field 2 and water_body 3 . According to the category co-occurrence knowledge, co-occurrence probability between flat_field 2 and flat_field 1 is 1, so the two nodes have a category cooccurrence relationship. Observing the value of the control gate at the same time, it can be found that the value of flat_field 2 is 0.95, which means that the control gate is open. While the co-occurrence probability between water_body 3 and flat_field 1 is smaller than 0.1, thereby, the value of the control gate is only 0.09, which means that the control gate is closed.
(b) The node with a spatial pyramid distance of 2 from flat_field 1 is only city_forest 4 . According to the category co-occurrence knowledge, the co-occurrence probability between city_forest 4 and flat_field 1 is smaller than 0.1, and the two nodes do not have a category co-occurrence relationship, so the value of city_forest 4 is 0.14, which also means that the control gate is closed.
(c) The nodes with a spatial pyramid distance of 3 from flat_field 1 include road 5 , city_forest 6 , city_forest 7 and town 8 , but, since there is no category co-occurrence relationship between them and flat_field 1 , their control gates are closed.
Next, we analyze the aggregation weight of flat_field 1 in the KSPGAT network that combines multi-source attention and gating mechanism. It can be seen that the aggregations of water_body 3 and town 8 are directly closed through the gating mechanism, while the aggregation of flat_field 2 is opened. And the attention weight of flat_field 2 is 0.65; other neighbor nodes are all smaller than 0.1. Therefore, the relative order of flat_field 2 's attention weight is changed through the gating mechanism. At this time, except for flat_field 1 itself, only flat_field 2 has a larger attention weight. The aggregation of other nodes that do not have a category co-occurrence relationship with flat_field 1 is completely suppressed, so flat_field 1 can be classified correctly.

Analysis of the Problem "Violating the First Law of Geography"
In this section, we will analyze the attention results of the node forest 1 in sample V to compare the recognition capabilities of the three object-based models on samples which are interfered by the problem of "violating the first law of geography". The following, Figure 19, shows the classification results and the object masks in sample V. category co-occurrence relationship, so the value of city_forest is 0.14, which also means that the control gate is closed. c) The nodes with a spatial pyramid distance of 3 from flat_field include road , city_forest , city_forest and town , but, since there is no category co-occurrence relationship between them and flat_field , their control gates are closed.
Next, we analyze the aggregation weight of flat_field in the KSPGAT network that combines multi-source attention and gating mechanism. It can be seen that the aggregations of water_body and town are directly closed through the gating mechanism, while the aggregation of flat_field is opened. And the attention weight of flat_field is 0.65; other neighbor nodes are all smaller than 0.1. Therefore, the relative order of flat_field 's attention weight is changed through the gating mechanism. At this time, except for flat_field itself, only flat_field has a larger attention weight. The aggregation of other nodes that do not have a category co-occurrence relationship with flat_field is completely suppressed, so flat_field can be classified correctly.

Analysis of the Problem "Violating the First Law of Geography"
In this section, we will analyze the attention results of the node forest in sample V to compare the recognition capabilities of the three object-based models on samples which are interfered by the problem of "violating the first law of geography". The following, Figure 19, shows the classification results and the object masks in sample V. Figure 19. The classification results and the object masks in sample V. The yellow box presents the forest object that will be analyzed in this section, and the red box presents the misclassified result.
The attention results of the node forest in sample V are shown in Table 12. Through the attention results of the baseline GAT in sample V, it can be seen that for the misclassified forest , besides itself, it mainly focuses on city_forest , city_forest , city_forest , and city_forest , with attention weights of 0.43, 0.41, 0.39, and 0.39, respectively. The attention weights of the other nodes are relatively small, and the attention weight of the flat_field is only 0.22.
Further considering the accumulation weights of the same category objects, the results are shown in Table 13. Figure 19. The classification results and the object masks in sample V. The yellow box presents the forest 1 object that will be analyzed in this section, and the red box presents the misclassified result.
The attention results of the node forest 1 in sample V are shown in Table 12. Through the attention results of the baseline GAT in sample V, it can be seen that for the misclassified forest 1 , besides itself, it mainly focuses on city_forest 5 , city_forest 11 , city_forest 7 , and city_forest 14  Further considering the accumulation weights of the same category objects, the results are shown in Table 13. For sample V, due to the problem of "violating the first law of geography", its final accumulation weight of city_forests is 1.62 (0.43 + 0.41 + 0.39 + 0.39), which accounts for the majority compared to the other categories, thereby causing the node of forest 1 to be misclassified.

Analysis of the Multi-Source GAT Network
Through the attention results of the multi-source GAT in sample V, it can be seen that after considering the spatial correlation, the attention weights of flat_field 2 , road 3 , and water_body 4 , in which spatial pyramid distance from forest 1 is 1, are raised. The attention weight of flat_field 2 is raised from 0.22 to 0.49, and its relative ranking also increases from the sixth to second, while the attention weights of the other nodes are reduced to varying degrees. Among them, the attention weights of four city-grass that are far from forest 1 , have decreased greatly.
However, due to the problem of violating the first law of geography, if there are a large number of same category objects in the far distance, they may accumulate to cause much effect on the central node. Table 14 shows the accumulation weights of the same category objects in V. For sample V, four city_forests are far away from forest 1 , and the weight of individual object is reduced. However, due to the large number, its final accumulation weight of city_forests is 1.06 (0.36 + 0.28 + 0.22 + 0.20), which still accounts for the majority compared to the other categories, thereby causing the node of forest 1 to still be misclassified.

Analysis of the KSPGAT Network
Compared to the multi-source GAT network, the KSPGAT network shows its advantages in sample V. Similarly, we first analyze the control gates of forest 1 and then investigate the final attention weight in the KSPGAT network. The control gate and aggregation weight of forest 1 are shown in Table 15. Through analyzing the values of the three control gates of forest 1 , it can be seen that among the neighbour nodes: (a) The nodes with a spatial pyramid distance of 1 from forest 1 include flat_field 2 , road 3 and water_body 4 . According to the category co-occurrence knowledge, the cooccurrence probability between flat_field 2 and forest 1 is 0.69, so the two nodes have a category co-occurrence relationship. Observing the value of the control gate at the same time, it can be found that the value of flat_field 2 is 0.98, which indicates that the control gate is open. The other two nodes do not have a category co-occurrence relationship with forest 1 , so the control gates are closed.
(b) The node with a spatial pyramid distance of 2 from forest 1 is only city_forest 5 . According to the category co-occurrence knowledge, the co-occurrence probability between them is smaller than 0.1, which means they do not have a category co-occurrence relationship, so the control gate is also closed.
(c) The remaining nodes are all at a spatial pyramid distance of 3 from forest 1 , but, since there is no category co-occurrence relationship between them and forest 1 , their control gates are all closed.
Next, we analyze the aggregation weight of forest 1 in the KSPGAT network. It can be seen that only the aggregation of flat_field 2 is open, since it has a category co-occurrence relationship with forest 1 , while the aggregations of the other nodes are closed. And the attention weight of flat_field 2 is 0.49, while the other neighbor nodes are all smaller than 0.1. Therefore, the relative order of forest 1 's attention weight is changed by the gating mechanism.
Considering the accumulation weights of the same category objects in V, the results are shown in Table 16. At this time, except for the central node itself, only flat_field 2 has a larger attention weight, so forest 1 can be classified correctly.

Discussion
Through the analysis in Sections 5.3 and 5.4, the following conclusions can be drawn: (1) The baseline GAT network relies more on the feature similarity, and it is easy to disturbed by the problem of "different objects with the same spectrum", while the multisource GAT network that also considers spatial correlation can strengthen or weaken the attention weights of neighbor nodes according to the distance between the central node and them, thereby solving the problem of "different objects with the same spectrum" to a certain extent.
(2) However, due to the distortion of the spatial relationships in some objects that are cut at the corners of the sample, the multi-source GAT network that just considers spectral similarity and spatial correlation may be affected by problem of "violating the first law of geography", thereby causing the central node to still be misclassified.
(3) The KSPGAT, which takes the category co-occurrence priori knowledge obtained from the whole research area into account, can expand the receptive field of the objects from the specific sample to the whole research area through the gating mechanism, so the KSPGAT network can correct the problem of "violating the first law of geography" to a certain extent. Therefore, the central node can be classified correctly.
(4) The KSPGAT network can control the aggregation of neighbor nodes through the gating mechanism based on the geographic prior knowledge (co-occurrence probability), thereby avoiding the problem of over-smoothing to a certain extent.

Conclusions
In this paper, a novel remote sensing semantic segmentation model is proposed to better recognize the objects disturbed by the problem of "different objects with the same spectrum" and effectively alleviate the problem of "violating the first law of geography". The model integrates the similarity of geographic objects, the spatial pyramid distance, and global geographic prior knowledge; and it uses a gating mechanism to control the process of node aggregation through prior knowledge, thereby embedding the higherlevel semantic knowledge of geographic objects into the remote sensing image semantic segmentation network. Furthermore, it can avoid the problem of over-smoothing to a certain extent. The experimental results show that our model improves the overall accuracy by 3.8% compared with the baseline GAT network.
Our future work will focus on the following directions: (1) The selection of segmentation scale Different types of geographic objects have different segmentation scales in remote sensing images, and how to balance them is still worth exploring.
(2) The suitable way to judge whether the two categories have co-occurrence relationship In our method, if max P ij , P ji ≥ 0.5, it can consider that category i and category j have co-occurrence relationship. How to choose the threshold in a suitable way requires more attempts.
(3) Automatic acquisition of prior knowledge In this paper, the prior knowledge is based on manual statistics and analysis. This method is affected by subjective factors, and it is not efficient. To improve the method of obtaining prior knowledge, an automatic learning way will be planned in our further researches.
(4) Apply the method to other research To verify the universality of the method, we will make further improvements to it and try to apply it to other research such as land use change [56] and analysis of water resources [57].