Multi-Label Remote Sensing Image Scene Classiﬁcation by Combining a Convolutional Neural Network and a Graph Neural Network

: As one of the fundamental tasks in remote sensing (RS) image understanding, multi-label remote sensing image scene classiﬁcation (MLRSSC) is attracting increasing research interest. Human beings can easily perform MLRSSC by examining the visual elements contained in the scene and the spatio-topological relationships of these visual elements. However, most of existing methods are limited by only perceiving visual elements but disregarding the spatio-topological relationships of visual elements. With this consideration, this paper proposes a novel deep learning-based MLRSSC framework by combining convolutional neural network (CNN) and graph neural network (GNN), which is termed the MLRSSC-CNN-GNN. Speciﬁcally, the CNN is employed to learn the perception ability of visual elements in the scene and generate the high-level appearance features. Based on the trained CNN, one scene graph for each scene is further constructed, where nodes of the graph are represented by superpixel regions of the scene. To fully mine the spatio-topological relationships of the scene graph, the multi-layer-integration graph attention network (GAT) model is proposed to address MLRSSC, where the GAT is one of the latest developments in GNN. Extensive experiments on two public MLRSSC datasets show that the proposed MLRSSC-CNN-GNN can obtain superior performance compared with the state-of-the-art methods.


Introduction
Single-label remote sensing (RS) image scene classification considers the image scene (i.e., one image block) as the basic interpretation unit and aims to assign one semantic category to the RS image scene according to its visual and contextual content [1][2][3]. Due to its extensive applications in object detection [4][5][6][7], image retrieval [8][9][10], etc., single-label RS image scene classification has attracted extensive attention. To address single-label RS classification, many excellent algorithms have been proposed [11][12][13][14]. At present, single-label RS scene classification has reached saturation accuracy [15]. However, one single label is often insufficient to fully describe the content of a real-world image.
Compared with single-label RS image scene classification, multi-label remote sensing image scene classification (MLRSSC) is a more realistic task. MLRSSC aims to predict multiple semantic labels to describe an RS image scene. Because of its stronger description ability, MLRSSC can be applied in many fields, such as image annotation [15,16] and image retrieval [17,18]. MLRSSC is also a more

•
We propose a novel MLRSSC-CNN-GNN framework that can simultaneously mine the appearances of visual elements in the scene and the spatio-topological relationships of visual elements. The experimental results on two public datasets demonstrate the effectiveness of our framework. • We design a multi-layer-integration GAT model to mine the spatio-topological relationship of the RS image scene. Compared with the standard GAT, the recommended multi-layer-integration GAT benefits fusing multiple intermediate topological representations and can further improve the classification performance.
The remainder of this paper is organized as follows: Section 2 reviews the related works. Section 3 introduces the details of our proposed framework. Section 4 describes the setup of the experiments and reports the experimental results. Section 5 discusses the important factors of our framework. Section 6 presents the conclusions of this paper.

Related Work
In the following section, we specifically discuss the related works from two aspects: MLRSSC and GNN-based applications.

MLRSSC
In early research on MLRSSC, handcrafted features were often employed to describe image scenes [44,45]. However, handcrafted features have limited generalization ability and cannot achieve an optimal balance between discriminability and robustness. Recently, deep learning methods have achieved impressive results in MLRSSC [32,46]. For instance, the standard CNN method can complete feature extraction and classification end-to-end with a deep network framework. Moreover, Zeggada et al. designed a multi-label classification layer to address multi-label classification via customized threshold operation [33]. To exploit the co-occurrence dependency of multiple labels, Hua et al. combined the CNN and the RNN to sequentially predict labels [34]. However, due to the accumulation of misclassification information during the generation of label sequences, the use of the RNN may cause an error propagation problem [47]. Hua et al. also considered the label dependency and proposed a relation network for MLRSSC using the attention mechanism [48]. These methods are limited by only considering visual elements in the image scene but disregarding the spatio-topological relationships of visual elements. In addition, Kang et al. proposed a graph relation network to model the relationships between image scenes for MLRSSC [49]. However, it mainly focused on leveraging the relationship between image scenes, and still did not model the spatial relationship between visual elements in each image scene.

GNN-Based Applications
The GNN is a novel model with great potential that can extend the ability of deep learning to process non-Euclidean data. The GNN is extensively applied to the fields of social network [50], recommender system [51], knowledge graph [52], etc. In recent years, some GNNs, such as the GCN, have been employed to solve image understanding problems. Yang et al. constructed scene graphs for images and completed image captioning via the GCN [53]. Chaudhuri et al. used the Siamese GCN to assess the similarity of the scene graph for image retrieval [54]. Chen et al. proposed a GCN-based multi-label natural image classification model, where the GCN is employed to learn the label dependency [43]. However, the GCN is limited for exploring complex node relationships because it only uses a fixed or learnable polynomial of the adjacency matrix to aggregate node features. Compared with the GCN, the GAT is a more advanced model, which can learn the aggregation weights of nodes using the attention mechanism. The adaptability of the GAT can make it more effective to fuse information from graph topological structures and node features [55]. However, due to the difference between image data and graph-structured data, it is still a problem worth exploring to mine the spatio-topological relationship of images via GAT.

Method
To facilitate understanding, our proposed MLRSSC-CNN-GNN framework is visually shown in Figure 1. Generally, we propose a way to map an image into graph-structured data and transform the MLRSSC task into the graph classification task. Specifically, we consider the superpixel regions of the image scene as the nodes of the graph to construct the scene graph, where the node features are represented by the deep feature maps from the CNN. According to the proximity and similarity between superpixel regions, we define the adjacency of nodes, which can be easily employed by the GNN to optimize feature learning. With the scene graph as input, the multi-layer-integration GAT is designed to complete multi-label classification by fusing information from the node features and spatio-topological relationships of the graph.

Method
To facilitate understanding, our proposed MLRSSC-CNN-GNN framework is visually shown in Figure 1. Generally, we propose a way to map an image into graph-structured data and transform the MLRSSC task into the graph classification task. Specifically, we consider the superpixel regions of the image scene as the nodes of the graph to construct the scene graph, where the node features are represented by the deep feature maps from the CNN. According to the proximity and similarity between superpixel regions, we define the adjacency of nodes, which can be easily employed by the GNN to optimize feature learning. With the scene graph as input, the multi-layer-integration GAT is designed to complete multi-label classification by fusing information from the node features and spatio-topological relationships of the graph.

Using CNN to Generate Appearance Features
Generating visual representations of the image scene is crucial in our framework. In particular, we use the CNN as a feature extractor to obtain deep feature maps from intermediate convolutional layers as the representations of high-level appearance features. To improve the perception ability of the CNN and make it effective in the RS image, we retrain the CNN by transfer learning [56].
Considering as the parameters of convolutional layers and as the parameters of fully connected layers, the loss function during the training phase can be represented by Equation (1):

Using CNN to Generate Appearance Features
Generating visual representations of the image scene is crucial in our framework. In particular, we use the CNN as a feature extractor to obtain deep feature maps from intermediate convolutional layers as the representations of high-level appearance features. To improve the perception ability of the CNN and make it effective in the RS image, we retrain the CNN by transfer learning [56].
Considering Θ as the parameters of convolutional layers and Φ as the parameters of fully connected layers, the loss function during the training phase can be represented by Equation (1): where f CNN (·) represents the nonlinear mapping process of the whole CNN network, I indicates an RS image, y (c) indicates the ground truth binary label of class c, and C is the number of categories. The process of feature extraction can be represented by Equation (2): where f FR (·) represents the feature representation process of the trained CNN, and M indicates the deep feature maps of image I. Note that the CNN can also be trained from scratch using the RS image dataset. However, considering that the size of the experimental dataset is small, we choose to fine-tune the weights of the deep convolutional layers to quickly converge the model.

Constructing Scene Graph
We construct the scene graph for each image to map the image into graph-structured data. Graph-structure data are mainly composed of the node feature matrix X ∈ R N×D and the adjacency matrix A ∈ R N×N , where N is the number of nodes and D is the dimension of features. In our framework, X is constructed based on appearance features from the CNN, and A is constructed according to the topological structure of the superpixel regions.
We use the simple linear iterative clustering (SLIC) superpixel algorithm [57] to segment the image and obtain N nonoverlapping regions to represent the nodes of the graph. The SLIC is an unsupervised image segmentation method that uses k-means to locally cluster image pixels, which can generate a compact and nearly uniform superpixel. It is notable that the superpixel consists of homogeneous pixels, so it can be assumed that it is an approximate representation of local visual elements. We apply the high-level appearance features as the initial node features to construct X. Specifically, we combine the deep feature maps M and segmentation results by upsampling M to the size of the original image. To catch the main visual features, we obtain the max value of the feature map slice according to each superpixel region boundary as the corresponding node feature. The node feature extraction will be repeated for each slice of M. Therefore, we can obtain multidimensional features from multiple channels of M as the node features.
We construct A considering the proximity and similarity between superpixel regions. We measure the spatial proximity of nodes by the adjacency of the superpixel regions and quantify the similarity of nodes by calculating the distance between superpixel regions in the color space, which satisfies human perception. In addition, we use the threshold of color distance to filter noisy links. When regions i and j have a common boundary, the adjacency value A ij is defined by Equation (3): where v i and v j represent the mean value of regions i and j in the HSV color space, and threshold t is empirically set to 0.2 according to the common color aberration of different categories. Note that our A is the symmetric binary matrix with the self-loop to define whether the nodes are connected. The specific adjacency weights will be adaptively learned in the GNN module to represent the relationships among nodes. The detailed process of constructing the scene graph is shown in Algorithm 1.

Learning GNN to Mine Spatio-Topological Relationship
Benefiting from the mechanism of node message passing, the GNN can integrate the spatio-topological structure into node feature learning. Thus, we treat the MLRSSC task as the graph classification task to mine the spatial relationships of the scene graph via the GNN. For graph classification, the GNN is composed of graph convolution layers, graph pooling layers and fully connected layers. Specifically, we adopt the GAT model [40] as the backbone of the graph convolution layer and design the multi-layer-integration GAT structure to better learn the complex spatial relationship and topological representations of the graph.

Algorithm 1 Algorithm to construct the scene graph of an RS image
Input: RS image I. Output: Node feature matrix X and adjacency matrix A. 1: for each I do 2: Extract deep feature maps M from image I; 3: Segment I into N superpixel regions R; 4: for each r ∈ R do 5: Obtain the max values of M according to the boundary of r in D channels, and update the vector X r ∈ R D of the matrix X; 6: Calculate the mean value v r of r in the HSV color space; 7: Obtain the adjacent regions list R of r; 8: end for 9: for each r ∈ R do 10: A rr = 1; 11: Calculate color distance v r − v r between r and r ∈ R ; 12: if v r − v r ≤ t do 13: A rr = 1; 14: end if 15: end for 16: end for

Graph Attention Convolution Layer
We construct the graph convolution layer following the GAT model to constantly update the node features and adjacency weights. With the attention mechanism, the adjacency weights are adaptively learned according to the node features, which can represent the complex relationships among nodes. Considering X i ∈ R D as the features of node i, the attention weight e ij between node i and node j is calculated with a learnable linear transformation, which can be represented by Equation (4): where is the concatenation operation, W ∈ R D ×D and H ∈ R 2D are the learnable parameters, and D indicates the dimension of the output features. The topological structure is injected into the mechanism by a mask operation. Specifically, only e ij for nodes j ∈ η i are employed in the network, where η i is the neighborhood of node i, which is generated according to A. Subsequently, e is nonlinearly activated via the LeakyReLU function and normalized by Equation (5): We can fuse information from the graph topological structures and node features by matrix multiplication between α and X. In addition, we adopt multi-head attention to stabilize the learning process. Considering X in ∈ R N×D as the input node features, the output node features X GAT ∈ R N×KD of a graph attention convolution layer can be computed by Equation (6): where represents the concatenation operation, α (k) is the normalized attention matrix of the k-th attention mechanism, and W (k) is the corresponding weight matrix. Equation (6), represents the concatenation of the output node features from K independent attention mechanisms.
To synthesize the advantage of each graph attention convolution layer and obtain comprehensive representation of the graph, we design the multi-layer-integration GAT structure that is shown in Figure 1. After multiple graph attention convolution layers, the hierarchical features of the same node are summarized as the new node features X mGAT , which can be computed by Equation (7): where X GAT (l) represents the output node features of the l-th graph attention convolution layer, and L is the total number of graph attention convolution layers.

Graph Pooling Layer
For graph classification, we use a graph pooling layer to convert a graph of any size to a fixed-size output. Specifically, we adopt differentiable pooling proposed in [58] to construct the graph pool layer. The idea of differentiable pooling is to transform the original graph into a coarsened graph through the way of embedded. Considering X in ∈ R N×D as the input node features and N as the new number of nodes, the embedded matrix S ∈ R N×N can be learned by Equation (8): where W emb ∈ R D×N represents the learnable weight and b emb is the bias. The softmax function is applied in a row-wise function. The node feature matrix output X GP ∈ R N ×D of a graph pooling layer can be calculated by Equation (9): Because the graph pooling operation is learnable, the output graph is an optimized result that represents the reduced-dimension input graph.

Classification Layer
After graph pooling, we flatten the node features matrix and obtain a finite dimensional vector to represent the global representation of the graph. Taking X in as the input node features, the flatten operation can be represented by Equation (10): where x is a feature vector. At the end of the network, we add fully connected layers followed by the sigmoid activation function as the classifier to complete the graph classification. The classification probability outputŷ of the last fully connected layer can be computed by Equation (11): where σ(·) is the sigmoid function, W f c represents the learnable weight and b f c is the bias. Furthermore, we apply the binary cross-entropy as the loss function, which can be defined by Equation (12): where Λ represents the parameters of the whole GNN network and y (c) indicates the ground truth binary label of class c. Via back-propagation, Λ can be optimized based on the gradient of the loss. Thus, we can use GNN to complete the multi-label classification in an end-to-end manner. The training process of the whole MLRSSC-CNN-GNN framework is shown in Algorithm 2.

Experiments
In this section, the data description is presented at first. Afterwards, the evaluation metrics and details of the experimental setting are shown. The experimental results and analysis are given at the end.
Algorithm 2 Training process of the proposed MLRSSC-CNN-GNN framework Input: RS images I and ground truth multi-labels y in training set. Output: Model parameters Θ and Λ.
Step 1: Learning CNN 1: Take I and y as input, and train CNN to optimize Θ according to Equation (1); 2: Extract deep feature maps M of I according to Equation (2); Step 2: Constructing scene graph 3: Construct node feature matrix X and adjacency matrix A of I according to Algorithm 1; Step 3: Learning GNN 4: for iter = 1, 2, . . . do 5: Initialize parameters Λ of the network in the first iteration; 6: Update X using L graph attention convolution layers according to Equation (4)-(6); 7: Fuse X GAT from L graph attention convolution layers according to Equation (7); 8: Cover X mGAT to a fixed-size output via the graph pooling layer according to Equation (8-9); 9: Flatten X GP and generate the classification probability after the classification layer according to Equation (10-11); 10: Calculate the loss based on the outputŷ of the network and y according to Equation (12); 11: Update Λ by back-propagation; 10: end for

Dataset Description
We perform experiments on the UCM multi-label dataset and AID multi-label dataset, which are described here. The UCM multi-label dataset contains 2100 RS images with 0.3 m/pixel spatial resolution, and the image size is 256 × 256 pixels. For MLRSSC, the dataset is divided into the following 17 categories based on the DLRSD dataset [59]: airplane, bare soil, buildings, cars, chaparral, court, dock, field, grass, mobile home, pavement, sand, sea, ship, tanks, trees, and water. Some example images and their labels are shown in Figure 2. Fuse from graph attention convolution layers according to Equation (7); 8: Cover to a fixed-size output via the graph pooling layer according to Equation (8-9); 9: Flatten and generate the classification probability after the classification layer according to Equation (10-11); 10: Calculate the loss based on the output ̂ of the network and according to Equation (12); 11: Update by back-propagation; 12: end for

Dataset Description
We perform experiments on the UCM multi-label dataset and AID multi-label dataset, which are described here. The UCM multi-label dataset contains 2100 RS images with 0.3 m/pixel spatial resolution, and the image size is 256 × 256 pixels. For MLRSSC, the dataset is divided into the following 17 categories based on the DLRSD dataset [59]: airplane, bare soil, buildings, cars, chaparral, court, dock, field, grass, mobile home, pavement, sand, sea, ship, tanks, trees, and water. Some example images and their labels are shown in Figure 2. The AID multi-label dataset [48] contains 3000 RS images from the AID dataset [60]. For MLRSSC, the dataset is assigned 17 categories, which are the same as those in the UCM multi-label dataset. The spatial resolutions of the images vary from 0.5 m/pixel to 0.8 m/pixel, and the size of each image is 600 × 600 pixels. Some example images and their labels are shown in Figure 3. The AID multi-label dataset [48] contains 3000 RS images from the AID dataset [60]. For MLRSSC, the dataset is assigned 17 categories, which are the same as those in the UCM multi-label dataset. The spatial resolutions of the images vary from 0.5 m/pixel to 0.8 m/pixel, and the size of each image is 600 × 600 pixels. Some example images and their labels are shown in Figure 3.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 17 Figure 3. Samples in the AID multi-label dataset.
Note that all the evaluation indicators are example-based indices that are formed by averaging the scores of each individual sample [62]. Generally, F1-Score and F2-Score are relatively more important for performance evaluation.

Experimental Settings
In our experiments, we adopt VGG16 [63] as the CNN backbone. The network is initialized with the weights trained on ImageNet [64], and we fine-tune it with the experimental datasets. In addition, we use fusion features by combining feature maps from the "block4_conv3" and "block5_conv3" layers in VGG16 as the node features of the scene graph. Thus, the total dimension of the initial node features is 1024.
Our recommended GNN architecture contains two graph attention convolution layers with the output dimensions of 512 and multi-head attention with = 3. The multi-layer-integration GAT structure is applied to construct the graph attention convolution layers. Subsequently, we set up one graph pooling layer that fixes the size of the graph to 32 nodes and two fully connected layers with the output dimensions of 256 and 17 (number of categories). Moreover, the dropout layer is set in the middle of each layer, and batch normalization is employed for all layers but the last layer. The network is trained with the Adagrad optimizer [65], and the learning rate is initially set to 0.01, which decays during the training process.

Evaluation Metrics
We calculate Precision, Recall, F1-Score and F2-Score to evaluate the multi-label classification performance [61]. The evaluation indicators are computed based on the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) in an example (i.e., an image with multiple labels). The evaluation indicators can be calculated using Equations (13) and (14): Fβ Score = 1 + β 2 Precision·Recall Note that all the evaluation indicators are example-based indices that are formed by averaging the scores of each individual sample [62]. Generally, F1-Score and F2-Score are relatively more important for performance evaluation.

Experimental Settings
In our experiments, we adopt VGG16 [63] as the CNN backbone. The network is initialized with the weights trained on ImageNet [64], and we fine-tune it with the experimental datasets. In addition, we use fusion features by combining feature maps from the "block4_conv3" and "block5_conv3" layers in VGG16 as the node features of the scene graph. Thus, the total dimension of the initial node features is 1024.
Our recommended GNN architecture contains two graph attention convolution layers with the output dimensions of 512 and multi-head attention with K = 3. The multi-layer-integration GAT structure is applied to construct the graph attention convolution layers. Subsequently, we set up one graph pooling layer that fixes the size of the graph to 32 nodes and two fully connected layers with the output dimensions of 256 and 17 (number of categories). Moreover, the dropout layer is set in the middle of each layer, and batch normalization is employed for all layers but the last layer. The network is trained with the Adagrad optimizer [65], and the learning rate is initially set to 0.01, which decays during the training process.
To pursue a fair comparison, based on the partition way in [48], the UCM and AID multi-label datasets are split into 72% for training, 8% for validation and 20% for testing. Note that instead of randomly division, this partition way is pre-set, where the training and testing samples have obvious style differences. Therefore, it will be more challenging for the classification methods. In the training phase, we only use the training images and their ground truth labels to train the CNN and the GNN. Specifically, we learn the CNN to extract deep feature maps of the images and then construct a scene graph for each image, which is the input of the GNN. In the testing phase, the testing images are fed into the trained CNN and GNN models to predict multi-labels.

Comparison with the State-of-the-Art Methods
We compare our proposed methods with several recent methods, including the standard CNN [63], CNN-RBFNN [33], CA-CNN-BiLSTM [34] and AL-RN-CNN [48]. For a fair comparison, all compared methods adopt the same VGG16 structure as the CNN backbone. We implement the standard CNN method as the baseline of MLRSSC and report the mean and standard deviation [66] of the evaluation results. Because the other methods also adopt the same dataset partition, we take the reported evaluation results from their corresponding publications as the comparison reference in this paper. It is noted that the existing methods don't report the standard deviation of evaluation results. As these methods don't release their source codes, it is hard to recover the standard deviation of the existing methods. Fortunately, we find the variance of repeated experiments is very slight, which helps to fully show the superiority of our proposed method. For the proposed methods, we report the results based on the MLRSSC-CNN-GNN via the standard GAT and the MLRSSC-CNN-GNN via the multi-layer-integration GAT, respectively.

Results on the UCM Multi-Label Dataset
The quantitative results on the UCM multi-label dataset are shown in Table 1. We can observe that our proposed MLRSSC-CNN-GNN via the multi-layer-integration GAT achieves the highest scores for Recall, F1-Score and F2-Score. In general, the proposed method achieves the best performance. The lower bound of our method can also be better than the performances of the existing methods. We can also observe that our methods with the GNN show significant improvement compared with the method that only uses the CNN. Compared with the standard CNN, the proposed method gains an improvement of 7.4% for F1-Score and an improvement of 7.09% for F2-Score, which demonstrates that learning the spatial relationship of visual elements via the GNN has an important role in advancing the classification performances. Moreover, the MLRSSC-CNN-GNN via the multi-layer-integration GAT performs better than the MLRSSC-CNN-GNN via the standard GAT, which shows the effectiveness of the proposed multi-layer-integration GAT. Some samples of the predicted results on the UCM multi-label dataset are exhibited in Figure 4. It can be seen that the proposed method can successfully capture the main categories of the scene. However, our method is still insufficient in the details, such as the prediction of cars, grass, and bare soil, which may be inconsistent with the ground truths.  Table 2 shows the experimental results on the AID multi-label dataset. We can also observe that our proposed MLRSSC-CNN-GNN via the multi-layer-integration GAT achieves the best performance with the highest scores of Recall, F1-Score and F2-Score. Compared to the standard CNN, the proposed method increases F1-Score and F2-Score by 3.33% and 3.82%, respectively. Compared to AL-RN-CNN, the proposed method gains an improvement of 0.55% for F1-Score and an improvement of 0.87% for F2-Score. Compared to the MLRSSC-CNN-GNN via the GAT, the proposed method gains an improvement of 0.32% for F1-Score and an improvement of 0.52% for F2-Score. Table 2. Performances of different methods on the AID multi-label dataset (%).

Methods
Precision Recall F1-Score F2-Score CNN [63] 87.62 ± 0.14 86. 13  Some samples of the predicted results on the AID multi-label dataset are exhibited in Figure 5. Consistent with the results on the UCM multi-label dataset, our method can successfully capture the main categories of the scene. The superior performances on both UCM and AID multi-label datasets can show the robustness and effectiveness of our method.

Discussion
In this section, we analyze the influence of some important factors in the proposed framework, including the number of superpixel regions in the scene graph, the value K of multi-head attention in the GNN, and the depth of the GNN.  Table 2 shows the experimental results on the AID multi-label dataset. We can also observe that our proposed MLRSSC-CNN-GNN via the multi-layer-integration GAT achieves the best performance with the highest scores of Recall, F1-Score and F2-Score. Compared to the standard CNN, the proposed method increases F1-Score and F2-Score by 3.33% and 3.82%, respectively. Compared to AL-RN-CNN, the proposed method gains an improvement of 0.55% for F1-Score and an improvement of 0.87% for F2-Score. Compared to the MLRSSC-CNN-GNN via the GAT, the proposed method gains an improvement of 0.32% for F1-Score and an improvement of 0.52% for F2-Score. Some samples of the predicted results on the AID multi-label dataset are exhibited in Figure 5. Consistent with the results on the UCM multi-label dataset, our method can successfully capture the main categories of the scene. The superior performances on both UCM and AID multi-label datasets can show the robustness and effectiveness of our method.

Effect on the Number of Superpixel Regions
When constructing the scene graph, the number of superpixel regions is a vital parameter, which determines the scale and granularity of the initial graph. Therefore, it is necessary to set an appropriate . Considering the tradeoff between efficiency and performance, we set the step size of the to 20, and study the effects of by setting it from 30 to 110. The results on the UCM and AID multi-label datasets are shown in Figure 6. It can be seen that when the is set between 50 to 90, our model can achieve better performance.

Sensitivity Analysis of the Multi-Head Attention
In the graph attention convolution layer of the GNN, we adopt multi-head attention to stabilize the learning process. However, a larger value of in multi-head attention will increase the parameters and calculation of the model. Thus, we study the effects of by setting it to a value from 1 to 5. The experimental results on the UCM and AID multi-label datasets are shown in Figure 7. Obviously, the use of multi-head attention can improve the classification performance because it can learn more abundant feature representations. It can be seen that when the value of reaches 3, the performance of the model begins to saturate. However, when the value of continues to increase, the model may face an overfitting problem.

Discussion
In this section, we analyze the influence of some important factors in the proposed framework, including the number of superpixel regions in the scene graph, the value K of multi-head attention in the GNN, and the depth of the GNN.

Effect on the Number of Superpixel Regions
When constructing the scene graph, the number of superpixel regions N is a vital parameter, which determines the scale and granularity of the initial graph. Therefore, it is necessary to set an appropriate N. Considering the tradeoff between efficiency and performance, we set the step size of the N to 20, and study the effects of N by setting it from 30 to 110. The results on the UCM and AID multi-label datasets are shown in Figure 6. It can be seen that when the N is set between 50 to 90, our model can achieve better performance.

Effect on the Number of Superpixel Regions
When constructing the scene graph, the number of superpixel regions is a vital parameter, which determines the scale and granularity of the initial graph. Therefore, it is necessary to set an appropriate . Considering the tradeoff between efficiency and performance, we set the step size of the to 20, and study the effects of by setting it from 30 to 110. The results on the UCM and AID multi-label datasets are shown in Figure 6. It can be seen that when the is set between 50 to 90, our model can achieve better performance.

Sensitivity Analysis of the Multi-Head Attention
In the graph attention convolution layer of the GNN, we adopt multi-head attention to stabilize the learning process. However, a larger value of in multi-head attention will increase the parameters and calculation of the model. Thus, we study the effects of by setting it to a value from 1 to 5. The experimental results on the UCM and AID multi-label datasets are shown in Figure 7. Obviously, the use of multi-head attention can improve the classification performance because it can learn more abundant feature representations. It can be seen that when the value of reaches 3, the performance of the model begins to saturate. However, when the value of continues to increase, the model may face an overfitting problem.

Sensitivity Analysis of the Multi-Head Attention
In the graph attention convolution layer of the GNN, we adopt multi-head attention to stabilize the learning process. However, a larger value of K in multi-head attention will increase the parameters and calculation of the model. Thus, we study the effects of K by setting it to a value from 1 to 5. The experimental results on the UCM and AID multi-label datasets are shown in Figure 7. Obviously, the use of multi-head attention can improve the classification performance because it can learn more abundant feature representations. It can be seen that when the value of K reaches 3, the performance of the model begins to saturate. However, when the value of K continues to increase, the model may face an overfitting problem.

Discussion on the Depth of GNN
The graph attention convolution layer in the GNN is the key part to learning the classification features of the graph. To explore the performance of the GNN in our framework, we build the GNN with a different number of graph attention convolution layers. Figure 8 shows the performance of our MLRSSC-CNN-GNN with one, two, and three graph attention convolution layers. The output dimensions of these layers are 512, and the remaining structures in GNN are the same. It can be seen that the MLRSSC-CNN-GNN with two graph attention convolution layers achieves the best performance with the highest F1-Score and F2-Score. However, when the number of graph attention convolution layers reaches three, both the F1-Score and F2-Score begin to drop. The possible reason for the performance drop of the deep GNN may be that the node features are oversmoothed when a larger number of graph attention convolution layers are utilized.

Conclusions
MLRSSC remains a challenging task because it is difficult to learn the discriminative semantic representations to distinguish multiple categories. Although many deep learning-based methods have been proposed to address MLRSSC and achieved a certain degree of success, the existing methods are limited by only perceiving visual elements in the scene but disregarding the spatial relationships of visual elements. With this consideration, this paper proposes a novel MLRSSC-CNN-GNN framework to address MLRSSC. Different from the existing methods, the proposed method can comprehensively utilize the visual and spatial information in the scene by combining the CNN and the GNN. Specifically, we encode the visual content and spatial structure of the RS image scene by constructing scene graph. The CNN and the GNN is used to mine the appearance features and spatio-

Discussion on the Depth of GNN
The graph attention convolution layer in the GNN is the key part to learning the classification features of the graph. To explore the performance of the GNN in our framework, we build the GNN with a different number of graph attention convolution layers. Figure 8 shows the performance of our MLRSSC-CNN-GNN with one, two, and three graph attention convolution layers. The output dimensions of these layers are 512, and the remaining structures in GNN are the same. It can be seen that the MLRSSC-CNN-GNN with two graph attention convolution layers achieves the best performance with the highest F1-Score and F2-Score. However, when the number of graph attention convolution layers reaches three, both the F1-Score and F2-Score begin to drop. The possible reason for the performance drop of the deep GNN may be that the node features are oversmoothed when a larger number of graph attention convolution layers are utilized.

Discussion on the Depth of GNN
The graph attention convolution layer in the GNN is the key part to learning the classification features of the graph. To explore the performance of the GNN in our framework, we build the GNN with a different number of graph attention convolution layers. Figure 8 shows the performance of our MLRSSC-CNN-GNN with one, two, and three graph attention convolution layers. The output dimensions of these layers are 512, and the remaining structures in GNN are the same. It can be seen that the MLRSSC-CNN-GNN with two graph attention convolution layers achieves the best performance with the highest F1-Score and F2-Score. However, when the number of graph attention convolution layers reaches three, both the F1-Score and F2-Score begin to drop. The possible reason for the performance drop of the deep GNN may be that the node features are oversmoothed when a larger number of graph attention convolution layers are utilized.

Conclusions
MLRSSC remains a challenging task because it is difficult to learn the discriminative semantic representations to distinguish multiple categories. Although many deep learning-based methods have been proposed to address MLRSSC and achieved a certain degree of success, the existing methods are limited by only perceiving visual elements in the scene but disregarding the spatial relationships of visual elements. With this consideration, this paper proposes a novel MLRSSC-CNN-GNN framework to address MLRSSC. Different from the existing methods, the proposed method can comprehensively utilize the visual and spatial information in the scene by combining the CNN and the GNN. Specifically, we encode the visual content and spatial structure of the RS image scene by constructing scene graph. The CNN and the GNN is used to mine the appearance features and spatio-

Conclusions
MLRSSC remains a challenging task because it is difficult to learn the discriminative semantic representations to distinguish multiple categories. Although many deep learning-based methods have been proposed to address MLRSSC and achieved a certain degree of success, the existing methods are limited by only perceiving visual elements in the scene but disregarding the spatial relationships of visual elements. With this consideration, this paper proposes a novel MLRSSC-CNN-GNN framework to address MLRSSC. Different from the existing methods, the proposed method can comprehensively utilize the visual and spatial information in the scene by combining the CNN and the GNN. Specifically, we encode the visual content and spatial structure of the RS image scene by constructing scene graph.
The CNN and the GNN is used to mine the appearance features and spatio-topological relationships, respectively. In addition, we design the multi-layer-integration GAT model to further mine the topological representations of scene graph for classification. The proposed framework is verified on two public MLRSSC datasets. As the experimental results shown, the proposed method can improve both the F1-Score and F2-Score by more than 3%, which demonstrates the importance of learning spatio-topological relationships via the GNN. Moreover, the proposed method can obtain superior performances compared with the state-of-the-art methods. As a general framework, the proposed MLRSSC-CNN-GNN framework is highly flexible, it can be easily and dynamically enhanced by replacing the corresponding modules with advanced algorithms. In future work, we will consider the adoption of more advanced CNN and GNN models to explore the potential of our framework. However, our proposed method has not explicitly modeled label dependency, which is also important in MLRSSC. In the future, we will focus on integrating this consideration into our method to further improve the performance.

Conflicts of Interest:
The authors declare no conflict of interest.