GSAP: A Global Structure Attention Pooling Method for Graph-Based Visual Place Recognition

The Visual Place Recognition problem aims to use an image to recognize the location that has been visited before. In most of the scenes revisited, the appearance and view are drastically different. Most previous works focus on the 2-D image-based deep learning method. However, the convolutional features are not robust enough to the challenging scenes mentioned above. In this paper, in order to take advantage of the information that helps the Visual Place Recognition task in these challenging scenes, we propose a new graph construction approach to extract the useful information from an RGB image and a depth image and fuse them in graph data. Then, we deal with the Visual Place Recognition problem as a graph classification problem. We propose a new Global Pooling method—Global Structure Attention Pooling (GSAP), which improves the classification accuracy by improving the expression ability of the Global Pooling component. The experiments show that our GSAP method improves the accuracy of graph classification by approximately 2–5%, the graph construction method improves the accuracy of graph classification by approximately 4–6%, and that the whole Visual Place Recognition model is robust to appearance change and view change.


Introduction
With the development of robotics and computer vision in recent years, improvement in the accuracy of localization and mapping is urgently needed. Given a sequence of images captured from different places, the images of the same place should be found, which is the Visual Place Recognition (VPR) problem [1]. VPR is a key component of image-based Localization, Mapping, and Simultaneous Localization and Mapping (SLAM). Because VPR can help reduce the accumulative error in the applications mentioned above, it has attracted more attention in recent years. It is a challenging task for the following three reasons (the first two are shown in Figure 1): • The viewpoint of the same place can change drastically when the place is revisited. • The appearance can change due to the illumination and seasonal change. • Having a large number of images in the database causes high computational cost.
Most of the previous research focuses on the image processing of the VPR system. Early studies (e.g., Bag of Words (BoW)-based [2]) use traditional descriptors such as SIFT or SURF, while it is not necessarily effective when appearance changes. Artificial intelligence has made great progress in recent years, with most researches focusing on deep learning methods [3][4][5]. Most of the VPR studies use convolutional features [6][7][8][9][10][11][12][13]. Chen et al. design a framework using convolutional features and untrainable pooling layers [8]. Arandjelovic et al. design a trainable pooling layer to improve the performance. However, convolutional features are not robust enough on appearance and viewpoint change [6]. In addition, semantic segmentation is introduced to construct a graph [14,15]. As semantic graph expression is invariant in the first two challenging scenarios, Gawel et al. use a random walk graph kernel method to extract descriptors in the semantic graph [14]. However, random walk is a kind of graph kernel method, and good performance of graph kernel methods usually comes at the cost of heavy computational consumption. Furthermore, the semantic segmentation results are not fully utilized when constructing graph data in Abel's study [14]. Subfigures (a-d) are taken from one place in different views, when they have different appearance (e.g., illumination, weather or seasonal change, or all), or both: Viewpoint changes between subfigures (a-d). Appearance changes between subfigures (a-d). Viewpoint and appearance both change between subfigures (a,d), which is the most challenging scene.
To solve the appearance change and viewpoint change problem, we try to extract more robust features. We first transform the image sequence to graph sequence by using results of semantic segmentation and depth image, i.e., we transform the place recognition to graph classification problem and we fuse the semantic and geometric features that are robust to appearance and view changes. As the Graph Neural Network (GNN) has shown advantages in dealing with graph data [16], we use the GNN model to complete the classification task, and we propose a novel graph Global Pooling method to improve the classification accuracy. The training of the constructed graph data is more time-consuming than training on the original image data directly. Because the most commonly used Graph Global Pooling is not injective, it cannot map distinct multisets of node features into unique embeddings [17]. This can lead to false positive results. We design a trainable global pooling layer to improve the expression ability, though the injective property is still not guaranteed.
The contributions of this paper are as follows.
• Based on the graph construction approach in X-View [14], we propose a graph data construction approach that transforms the RGB image and the depth image into the graph for place recognition by extracting both semantic and geometry information. • To improve the expression ability of the GNN architecture, we propose a Global Pool method-Global Structure Attention Pooling. Compared with the most commonly used global pooling methods, e.g., global mean pooling, global max pooling, and global sum pooling, our pooling method is a trainable pooling method improving the expression ability. Compared with other trainable pooling methods, our work not only catches the first-order neighbor information by learning attention scores, but also models the spatial relation by the Gate Recurrent Unit (GRU) for higher-order neighbor information.
The rest of the paper is organized as follows. Some related work and their similarities and differences with the VPR problem are presented in Section 2. The methodology of the proposed graph construction method and GSAP is presented in Section 3. Experiment results are shown in Section 4 and followed by discussions in Section 5. The conclusion is drawn in Section 6.

Related Work
In this section, we review some literatures related to the components of our proposed VPR architecture, including VPR research works based on different kinds of feature or information, image retrieval, and graph classification.

Visual Place Recognition
As cameras are the primary sensors of autonomous systems, visual place recognition has been attracting increasingly more attention, aiming to choose loop closure candidates for the SLAM algorithm. There are two main challenges to perform visual place recognition based on the differences of scenes: one is the appearance change caused by illumination condition and seasonal changes, and the other is the viewpoint change caused by revisiting one place from different viewpoints [1]. In the VPR literature, various feature extraction methods have been developed for visual place recognition, including deep convolutional feature-based methods [6][7][8][9][10][11][12][13], handicraft feature-based methods [2,18], semantic information-based methods [19][20][21][22][23][24][25], sequence-based methods [26,27], and graph-based methods [19,20,[28][29][30][31][32]. Overall, most of these studies focus on the image processing module of the visual place recognition system, which aims to extract and describe features that are robust in the different challenge conditions as mentioned above. Compared with these graph-based methods, our work also concentrates on the extraction and representation of global features. However, the novelty lies in that we are the first to promote the VPR problem by improving the expression ability of Global Graph Pooling. Our model can recognize more complex patterns by learning structural information and property information alternately. Compared with these convolutional-based methods, our work is more robust in drastically view changing and appearance changing scenes.

Image Retrieval
With the wide spread of the Internet and search engines, efficient and accurate image retrieval has greatly progressed in recent years, including class-level and instance-level image retrieval. Given an image of an instance, other images of this instance in the database should be found, which is the aim of instance-level image retrieval. If the given instance is an image of a certain place and the database contains a sequence of images w.r.t. the place, it becomes a VPR problem. Most VPR problems are considered retrieval problems as well. In the literature, various methods have been developed for image retrieval, including text-based methods, content-based methods [33][34][35][36][37][38], sketch-based methods [39][40][41], and semantic-based methods [36,41,42]. Overall, most of the studies focus on feature extraction and representation. Place recognition can be treated as a classification problem as well [8].
Graph classification is a graph-level learning task, which concentrates on global information. A common architecture is used to fuse the local information and then fuse the global information. To fuse local information, Hierarchical Pooling is generally utilized, e.g., TopKPooling [50,51], SAGPooling [52], EdgePooling [53], ASAPooling [54], etc.
After the Hierarchical Pooling layer, a Global Pooling layer is designed to get the global embedding. Such mechanisms can aggregate all the nodes at one time and get a fixed length global representation. A sum, mean, and max function is often utilized as Global Pooling layer. However, the expression ability of such layers is not enough, which means different graph features may have the same representation and result in false classification. Structural information is completely lost using these kinds of Global Pooling layers. To improve the variousness of global representation and classification accuracy, some Global Pooling methods are studied [55][56][57]. The Recurrent Neural Network (RNN) achieves increasing attention in modeling sequence data [58,59]. In this paper, we use edges to compute the score for node features, get a feature sequence, and then extract global representation by GRU [60], which can fuse the structural information and node information together in global representation.

Methodology
In this section, we present the proposed graph-based VPR model. First, an overview of the VPR pipeline is given. Second, we present the proposed graph construction approach in detail. Finally, the details of the GSAP method are described, the process after Global Pooling is described briefly, and the loss function used in our approach is presented.

Overview of the VPR Pipeline
An overview of the proposed VPR pipeline is shown in Figure 2. We use semantic segmentation and depth image pairs to construct graph data, which are done off-line. The detailed process of graph construction is presented in Section 3.2. The graph data are fed into a GNN model to get the graph embedding. Then, the graph embedding is used in a Multilayer Perception (MLP) to get the classification results. In our GNN model, we use GIN [17] as Graph Convolution layer to aggregate the node features. After this, we apply Batch Normalization (BN) over a batch of node features [61]. Then, we use a Global Pooling (GP) layer to learn the global representation. The reason why we do not adopt Hierarchical Pooling, e.g., SAGPooling [52] or EdgePooling [53], is that introducing Hierarchical Pooling cannot improve the performance of graph classification. Actually, in most GNN architectures, the convolutional layer can quickly lead to smooth node representations [62].
The Graph Convolution layer (GIN) updates node representations as follows: Here, h Θ denotes a neural network. We use MLP in this paper. can be a parameter to be learned or a fixed scalar. x i is the node representation of the i−th node. N (i) is the set of nodes adjacent to i. x j represents the neighbor node of x i . x i is the node representation of x i in the next layer.
We propose a novel Global Pooling method-Global Structure Attention Pooling. The details are described in Section 3.3.

Graph Construction
Based on Gawel et al.'s graph construction approach [14], we proposed a graph construction approach for the 3-D case. In our approach, the geometric and semantic information is utilized to reserve more information that contributes to the following classification. We define a graph G = (V, E , X ), where V, E , X are the set of nodes, edges, and node features, respectively. The workflow is listed in Algorithm 1. First, we get the semantic segmentation and depth image corresponding to the RGB image, which is the same as Gawel et al. [14] do in their work. Second, we extract semantic labels and blob attributes from the semantic segmentation result and depth image. Finally, the undirected graph is assembled as follows:

Algorithm 1 Graph Construction
Input: RGB image I, depth image D Output: constructed graph G = (V, E , X ) 1: compute semantic segmentation results S 2: extract blobs in S 3: blobs → V 4: for each blob do 5: compute u, v, x, y, w, h, a 6: find node label in S 7: find the depth corresponding to (u, v) in D 8: compute (X, Y, Z) 9: onehot(label), X, Y, Z, x, y, w, h, a → x, x ∈ X 10: end for 11: for every two blobs do compute d e 15: if d e < d t and N = 2 then 16: edge connected, edge ∈ E 17: else 18: do nothing 19: end if 20: end for

Nodes Determination and Node Labels
Every blob is regarded as a node. Every blob has a corresponding semantic label. We regard the semantic label as the graph label.

Node Attributes
The node attributes include 8 elements: X, Y, Z, x, y, w, h, a. Where (X, Y, Z) is the 3-D location expressed in camera frame. The last 5 elements are blob attributes: (x, y) is the top left corner coordinate of the external polygon of a blob, and w, h, a are the width, height, and area of the external polygon of a blob, respectively.
Given the semantic segmentation result, the last 5 elements can be computed by using openCV. The first three elements are derived as follows.
By using the depth image, every node has its corresponding depth and pixel location. We transform this information to the 3-D location in the camera frame with a camera projection model [63]: where (u, v) is the pixel coordinates, and f x , f y , c x , c y are camera intrinsics.

Edges Determination
The edges are connected by the blobs' proximity and their 3-D distances. The edges are without labels and attributes. We find the proximate blobs by bitwise or operation of every two blobs b 1 , b 2 and we can get b or . After that, we compute the number N of connected components of b or . If N = 2, b 1 and b 2 are proximate blobs. To exclude the false neighbors caused by the shelter, we also consider the Euclidean distance d e between (X b1 , Y b1 , Z b1 ) and (X b2 , Y b2 , Z b2 ) which are the locations of b 1 and b 2 . If d e is smaller than the threshold d t , b 1 and b 2 are not false neighbors. Overall, if N = 2 and d e < d t , the edge between these two nodes is connected. Conversely, there is no edge between these two nodes.

Node Features
We combine the node label and node attributes together as node feature. Thus, the input node feature x i of our GNN architecture is where onehot(label) is the node label one hot encoding of node i. In this way, every image is transformed into a graph.

Global Structure Attention Pooling
Expression ability has been widely researched in recent GNN-related studies [17]. We design a Global Structure Attention Pooling (GSAP) method to improve the expression ability of the Global Pooling layer and the performance of graph classification. The basic process of GSAP is shown in Figure 3 and the detailed process is as follows.
In general, edges and nodes in one graph have relation with each other. Considering that edges in graphs contain structural information, we compute a score for each edge by its corresponding two node features. We concatenate the two features and obtain the score of every edge with a single full connection layer with LeakyReLU activation function. The score of the edge between node i and node j is computed by the following equation: where represents concatenation, w and b are the parameters that need to be learned, n is the dimension of x i , and x j .
Then, the node score is computed according to its connected edges: where N (i) is the set of nodes adjacent to i. x 2 x 3 x s x g x g' Figure 3. The Global Structure Attention Pooling (GSAP) process. Qualitatively, we assume that the graph has three nodes.
Given node scores, the feature sequence fuses node information and structural information together. The node feature x i is weighted by its node score s i . As graphs may have a different number of nodes, the length of the feature sequences may be different. We add 0 vector in the end to keep the sequence to a certain length. Then, we get the feature sequence x s as follows: where k is the number of nodes in the graph. 0 is zero vector whose dimension is n as well. The number of 0 is an adjustable parameter m. In order to reserve all the information of a feature graph, k and m should satisfy the following condition: where k max is the maximal number of nodes in all graphs. Even if we get the sequence of x s by considering the first-order neighbor of each node, the nodes without connection to each other can also have a certain relation. GRU can model the information of several consecutive nodes with its reset gate and update gate. We utilize GRU to extract the feature graph global descriptor x g .
where x g , x g ∈ R p×(k+m) is a padded concatenation of every GRU output. Finally, we reshape matrix x g into a vector x g : Let q = p × (k + m), then x g ∈ R q . In our GNNs, we concatenate the three feature graph global descriptors x g1 , x g2 , and x g3 . Thus, the graph global descriptor for classification is x gd , x gd ∈ R 3q is passed into MLP layer, followed by a Log Softmax function to generate a probability distribution over all classes and compute its logarithm for numerical stability, and then we compute NLL loss to optimize the parameters of the network.

Datasets
We use two original datasets to prepare their corresponding graph datasets: Airsim dataset [14] and SYNTHIA dataset [64], in which RGB images, depth images, semantic segmentation labels, and odometry information are provided.

Airsim
This is a simulated dataset made by a photo-realistic Airsim framework [14]. Images of top-down view and forward-facing view (as shown in Figure 4) are collected by an Unmanned Aerial Vehicle and a car. Each of these view sequences contains over 5000 images with associated ground truth. Here, the semantic classes are misc, street, building, car, sign, fence, hedge, tree, wall, bench, power line, rock, and pool. For each view sequence, useful information is provided, such as ground truth for semantic segmentation, instance segmentation, global camera poses, depth images, and calibration parameters. The viewpoint change between each top-down and forward-facing view pair is 90 • . The class is balanced as the camera motion is uniform.

Forward view
Downward view We use the waypoint files to sample images at a constant distance of around 10 m that contains 50 image frames. We deal with the forward-facing trajectory in the same way. Ignoring the offset in z direction, the forward view and its corresponding downward view are given the same label when constructing graph label file. Every class has 100 graphs in total. We construct the node and edge level information by the approach described in Section 3.2. Some statistical data of our Airsim graph dataset are shown in Table 1. The number of nodes reflects the richness of the semantic information. The number of edges reflects the complexity of the graph structure. Whether the class is balanced has an effect on the choice of evaluation metrics. Different from the Airsim dataset, each subsequence in the SYNTHIA dataset consists of the same traffic situation but under different weather, illumination, and season conditions, and the class is imbalanced as the camera motions are not uniform. The current subsequences are Spring, Summer, Fall, Winter, Rain, Soft-rain, Sunset, Fog, Night, and Dawn. Each of these subsequences contains approximately 8000 images with associated ground truth. For each subsequence, useful information is provided, such as 8 views, ground truth for semantic segmentation, instance segmentation, global camera poses, depth images, and calibration parameters. Here, the semantic classes are misc, sky, building, road, side walk, fence, vegetation, pole, car, sign, pedestrian, cyclist, and lane-marking.
We use global camera locations to label image frames every 20 m. As shown in Figure 5, we choose the forward and leftward views to make sure the viewpoint change is 90 • , and we choose four kinds of appearances, i.e., Dawn, Night, Summer, and Fog. The key steps of node and edge construction are the same as Airsim. The statistical data of the SYNTHIA graph datasets are shown in Table 1 as well. The same places in different subsequences have different appearances and views.

Evaluation Metrics
For the graph classification task, we utilize various evaluation metrics from the previous classification work. For the experiments using the Airsim dataset, we aim to measure the contribution of the key components in our GNN model. We use Accuracy (Accuracy = Correcct Prediction Number

Total Sample Number
) as the evaluation metric, as the Airsim dataset has a balanced class. We use Training Accuracy to measure the expression ability of the GP methods and use Test Accuracy to measure the generalization ability.
For the experiments using the SYNTHIA dataset, we aim to compare the performance of our VPR model with other VPR methods. We remove the MLP after training, compute the embeddings of all the graphs in test set, and then compute the Euclidean distances of every two embeddings. We show the performances via the Precision-Recall Curve as the class in SYNTHIA dataset is imbalanced.

Task Setting
For the experiments in Section 5, the Airsim dataset is divided into three parts with a ratio of 6:2:2, the data of two sequences are mixed together, namely, the training set, validation set, and test set.
For the experiments in Section 4.5, the forward sequence and downward sequence are divided into three parts with a ratio of 6:2:2, respectively. Here, we use the same random seed of the random split function for different sequences to make sure the train, validation, and test sets have no geographical overlap. The first and second parts of the downward sequence are considered as the training set and validation set, respectively. The third part of the forward sequence is considered as the test set.
As for the SYNTHIA dataset, the four subsequences are divided in the same way as mentioned above for the Airsim dataset. The first and second parts of the "Dawn" subsequence are considered as the training set and validation set, respectively. The third part of the "Night", "Summer", and "Fog" subsequences is considered as the test set, respectively.
We use AdapNet [65] as a semantic segmentation net. First, we train AdapNet [65] on the training set. Second, we use AdapNet [65] and Algorithm 1 to obtain the graph data of all the datasets. Third, we train the GNN model by using the graph data. Finally, the classification test results can be obtained.

Methods to Compare
The hyperparameters for our model and all the models to be compared are searched in a certain range as shown in Table 2. We consider the following VPR methods as comparison to our VPR model in the SYNTHIA dataset. Table 2. The hyperparameters of our model and compared models. The hidden size is only for Graph Neural Network (GNN) models. The 0 value of Weight decay is used for the experiments on training set to show the expression ability of the GNN models. The number of nodes reserved is only for the model with GSAP.

Hyperparameter Range
Learning rate 1 × 10 −4 , 5 × 10 −3 , 1 × 10 −3 , 1 × 10 −2 Hidden size 128, 256, 512 Weight decay 0, 1 × 10 −2 , 1 × 10 −3 , 1 × 10 −4 Number of nodes reserved 50-100 • NetVLAD [7]: This is a deep learning method that combines Locally Aggregated Descriptors (VLAD) with Convolutional Neural Networks. For a fair comparison, we use MLP to classify after getting the global descriptor, and the training and test processes are also the same as our VPR method. • AMOSNet [8]: This is also a deep leaning method using a 2-D image. It uses a convolution kernel, and the feature map is fed into two fully connected layers. • DBoW2 [2]: It extracts the handicraft features, generates the dictionary by clustering these features, and looks for the corresponding words of a query image in the dictionary.
We consider the following Global Pooling methods to be compared with our GSAP methods in the Airsim dataset. Set2Set [55]: Based on iterative content-based attention, it has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory to output sequences.
The abstracted information in the graph construction step has an effect on the following task. We conduct the experiments in the Airsim dataset to compare the different kinds of graph construction approaches: • Semantic Information based [14]: It extracts semantic labels, and instances center 3-D locations as graph features. • Geometric Information based: We evaluate the effectiveness of the geometric information separately. It extracts the instance center 3-D location; the top left corner coordinate of the external polygon of a blob; and the width, height, and area of the external polygon of a blob as graph feature.

Results and Analysis
We present the results of all the VPR comparison methods in the SYNTHIA dataset in Figure 6. First, our VPR model performs the best among all the compared VPR models; it ensured a relatively high Precision Rate when Recall Rate becomes larger. The Precision Rate becomes very low when the Recall Rate is close to 1. On one hand, this could be because the overlap between two adjacent frames is relatively high, i.e., they are on the class boundary. As the data we use are image sequences, the gap between two proximate frames with different classes is too small to distinguish. On the other hand, some images of the same class with different views do not have overlap with each other, which is another possible reason. Second, the 2-D image deep learning-based method NetVLAD does not perform better than our VPR model. This could be because it cannot utilize the 3-D information. Besides, it relies on the scene appearance and viewpoint. In our experiments, the data in the test set have very different appearances and views compared with the training set. NetVLAD is not robust to these kinds of changes, which leads to a bad performance generalizing to different scenes. Third, AMOSNet performs worse than NetVLAD. It uses similar convolutional layers for feature map generation. Different from AMOSNet, NetVLAD has a trainable pooling layer to learn the crusting centers of local features (as shown in Table 3), which leads to the generation of more effective image representations. Finally, DBoW2 gives a worse performance as the handicraft feature is not robust enough compared with the graph feature and deep convolutional feature.  Figure 6. The Precision-Recall Curve comparison of the VPR models in subsequences "Night", "Summer", and "Fog" of SYNTHIA dataset. We present the results of all the VPR comparison methods in the Airsim dataset as well. As shown in Figure 7, compared with the results in the SYNTHIA dataset, our VPR model also performs the best among all the compared VPR models. However, AMOSNet performs better than NetVLAD in the Airsim dataset, which means that AMOSNet is better at recognizing the viewpoint change scene. The trainable pooling layer of NetVLAD can improve the expression ability, but strong expression ability may lead to weak generation ability sometimes. Thus, expression ability and generation ability need to be balanced. By summarizing all the results, we can see that our GNN-based model is competitive for solving the VPR task on both the SYNTHIA and Airsim datasets, especially when the view and appearance change drastically.

Effect of Structure Attention Pooling Method
We conduct the experiments in the Airsim dataset in order to compare the performances of Global Pooling methods. Global Pooling is a necessary component in graph classification. In our GNN model, we propose GSAP to get the global descriptor. We first examine the effect of GSAP by comparing with other Global Pooling methods. We compare them by using classification accuracy. The expression ability can be measured in the training dataset. Figure 8 shows that GSAP reaches the highest accuracy using the least epochs and its curve is the most stable without sudden change. GDP, GMP, and Set2set have similar performance. GAP performance the worst. As for generalization ability, it shows that our GSAP method achieves the best accuracy among these methods in the upper half of Table 4. Structural information can help improve the classification accuracy.  Figure 8. The Training Accuracy results. The right half is a partial enlargement of the left half. This figure reflects the expression ability of the compared global pooling methods. The higher the training accuracy they achieve, the stronger expression ability they have.

Effect of Graph Construction
We conduct the experiments in Airsim to compare and evaluate the effectiveness of the graph construction approaches. By using different construction approaches shown in Section 4.4, we get different graph data. Results in the lower half of Table 4 show that a relatively high accuracy is achieved when just relying on geometric information, but the distribution of values is relatively dispersed. Semantic information contributes to a more stable performance but the accuracy values are relatively low. Our graph construction approach integrates and reserves more effective information for the graph classification task, which leads to the best performance. The main difficulty when applying it is that the graph construction would be time-consuming if computing resources are limited.

Conclusions
In this paper, we transform the VPR problem into a graph classification task, and then we use the GNN model to solve the VPR problem. In our data preparation task, we propose a graph construction approach that extracts core information for the classification task. In our GNN model, we design a Global Pooling method by transforming graph features to a sequence and predicting the global representation by GRU. We conduct extensive experiments in different appearances and view scenes to verify the effectiveness and robustness of our VPR model. We can conclude that the expression ability improvement of GNNs can contribute to the graph classification performance. The proposed method outperforms the state-of-art VPR algorithms in terms of the Precision rate and Recall rate. The limitation is that the graph construction would be time-consuming if the computing resource is limited.
In our current work, the graph construction is done off-line. The whole architecture is not end to end. In future work, we will consider improving the edges construction approach by link prediction to make the constructed data more suitable for the following classification task and so that the whole network can be trained end to end. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.