Representing Spatial Data with Graph Contrastive Learning

: Large-scale geospatial data pave the way for geospatial machine learning algorithms, and a good representation is related to whether the machine learning model is effective. Hence, it is a critical task to learn effective feature representation for geospatial data. In this paper, we construct a spatial graph from the locations and propose a geospatial graph contrastive learning method to learn the location representations. Firstly, we propose a skeleton graph in order to preserve the primary structure of the geospatial graph to solve the positioning bias problem of remote sensing. Then, we deﬁne a novel mixed node centrality measure and propose four data augmentation methods based on the measure. Finally, we propose a heterogeneous graph attention network to aggregate information from both the structural neighborhood and semantic neighborhood separately. Extensive experiments on both geospatial datasets and non-geospatial datasets are conducted to illustrate that the proposed method outperforms state-of-the-art baselines.


Introduction
The geospatial data play an important role in many real-world problems, such as population migration prediction, intelligent transportation systems and automated driving.Recently, machine learning algorithms [1][2][3][4][5] have been successfully used in various fields such as health, finance, travel, computer vision and natural language processing.Largescale geospatial data pave the way for geospatial machine learning algorithms.Data representation engineering is the foundation of machine learning, and a good representation is the key to learning an effective machine learning model [6].Therefore, it is critical to learn effective representation for geospatial data.A geospatial graph is usually adopted to capture the complex relationships between different geo-locations in real-world scenarios.Each node in the geospatial graph is further associated with node features or other types of attributes, which contain rich semantic information.In this paper, we focus on the representation of the nodes in geospatial graphs.Specifically, each node is represented by a low-dimensional vector with meaningful semantic and structural information.These potential geospatial representations could be used to improve the accuracy of machine learning models, and enable rich downstream activities.
Traditional unsupervised graph representation learning approaches such as Deep-Walk [7] and node2vec [8] excessively rely on the proximity information defined on the network structure.Recently, contrastive learning has seen a renewed surge of interest [9][10][11][12][13][14]. Contrastive learning aims to learn representations by maximizing feature consistency under differently augmented views.When combining with graph neural networks, contrastive learning can potentially overcome the aforementioned limitations of proximity-based approaches.Hence, we propose to learn geospatial graph representation based on contrastive learning.
However, the geospatial graph representation is a daunting task due to the following challenges: (1) Positioning devices are not accurate.Positioning bias from remote sensing can lead to deviation information in the geospatial data.As shown in Figure 1a, assuming shops A and B are located in the same building, very close to each other, remote sensing-based tools may incorrectly record a user's visit to A as a visit to B. Some other factors such as human errors in data preparation may also cause errors in the geospatial data.(2) It is difficult to learn meaningful representation for the nodes that are sparsely connected.Most of the existing contrastive learning models directly aggregate structural neighborhood features.As shown in Figure 1b, a sparse connected node A can only aggregate information from one neighbor node.The performance of the contrastive learning models can be severely affected by such nodes.To solve this problem, Wang et al. [15] extract the embeddings from node features, topological structures and their combinations simultaneously.Then, they use the attention mechanism to aggregate the three embeddings.Wei et al. [16] construct a KNN graph by using the attribute features of nodes and use the KNN graph to enhance the node embeddings.However, these methods ignore the heterogeneity in the edges.(3) Traditional data augmentation techniques may break critical information in the geospatial graphs.The good performance of a contrastive learning method depends on a reasonable data augmentation technique.Most existing data augmentation techniques [9,10,12] increase the size of training data by randomly perturbing nodes and edges in the original graph, which is likely to break the connectivity and structural features of the original graph.As can be seen from Figure 1c,d, deleting important nodes may cause the graph to lose a large number of edges.Removing important edges may break the original graph into several independent sub-graphs.In this paper, we propose a novel geospatial graph representation model, namely Semantic Enhanced-Graph Contrastive Learning (SE-GCL).In order to address the first challenge, we propose to generate a skeleton graph from the original graph.The skeleton focuses on the primary structure of the original geospatial graph and ignores the finegrained details, such that the errors introduced by positioning devices are disregarded.In order to address the second challenge, we build a semantic geospatial graph by injecting semantic edges into the original geospatial graph.The semantic edge captures the similarity between the associated features or attributes of node pairs.With the injected semantic edges, the semantic geospatial graph is denser.We further propose a heterogeneous graph attention network (HGAT) that aggregates information from the original edges and the injected semantic edges.Finally, to address the last challenge, we define a novel mixed node centrality measure and propose four data augmentation methods based on the measure.The proposed data augmentation methods preserve important information in the geospatial graphs.The main contributions of this paper are summarized as follows:

•
In order to solve the incorrect information caused by the positioning device, we propose to generate a skeleton graph from the original graph.The skeleton graph preserves the primary structure of the geospatial graph, while it ignores the finegrained details, which disregards the errors introduced by positioning devices.

•
We inject semantic edges to capture the similarity between the associated features and attributes of a pair of nodes.We propose the HGAT to aggregate information from both structural and semantic neighborhoods.The incorporated semantic information provides extensive information for the nodes that are sparsely connected to learn meaningful representation.

•
We propose four novel data augmentation methods based on node centrality measures.
Compared with the random perturbation methods, the proposed data augmentation methods can better preserve the important information in the geospatial graph.

•
We conduct experiments on two real-world geospatial datasets.The experiments demonstrate that the proposed method significantly outperforms state-of-the-art methods in multiple downstream tasks.In addition, we conduct experiments on several non-geospatial datasets.The experimental results show that the model is effective in node classification and graph classification.These results show that the proposed method has good scalability and can be well extended to other applications.

Geospatial Data Prediction
The statistical models used for geospatial data prediction include recursive decomposition [17] and naïve Bayes [18].These approaches rely on several assumptions.However, nowadays spatial data have become much more complex, and they no longer satisfy those assumptions any more.Since deep learning has brought about breakthroughs in many domains, more and more researchers apply deep learning to geospatial data prediction.JLGE [19] combines the recommendation of places of interest with graph embedding.It jointly learns the embedding of six graphs, including two single-parts (user-user and POI-POI) and four two-parts (user-location, user-time, location-user and location-time).LBSNE [20] formalizes metapath-based random walks on LBSN to construct heterogeneous neighborhoods of nodes.Then, it uses the learned heterogeneous neighborhood sequence to build the heterogeneous hopper model for network embedding.SE-KGE [21] encodes the spatial information such as point coordinates or bounding boxes of geographic entities into knowledge graph embedding space for handling different types of spatial inference.Then, it constructs a geographic knowledge graph and a set of geographic query-answer pairs.VirHpoi [22] introduces hypergraphs into heterogeneous embeddings to achieve point-of-interest recommendation services.

Graph Representation
Graph representation models aim to convert the input graph data into low-dimensional vector representations.Those representations benefit downstream tasks such as node classification or graph classification, etc. Traditional representation models include GCN [23], GAT [24] and GraphSAGE [25].GCN [23] transfers graph domain convolution into frequency domain for graph node embedding based on Laplace matrix and Fourier transform.GAT [24] introduces an attention mechanism to adaptively assign different weights to different nodes.GraphSAGE [25] conducts inductive graph node embedding based on a sub-graph sampling strategy.These models only focus on the node's structural neighborhood and ignore the rich semantic information.In order to capture semantic information, UGCN [26] introduces multi-type convolution to jointly extract information from one-hop, two-hop and semantic neighbors of the target node.Similarly, SimP-GCN [16] constructs a KNN graph based on the similarity information between attributes, and fuses it with the adjacency matrix of the original graph.AM-GCN [15] extracts topological embedding, semantic feature embedding, topological and semantic common embedding based on GCNs, and combines them through an attention mechanism.However, the above methods only take semantic information as the supplement of structural information and ignore the heterogeneity of semantic information and structural information.

Graph Contrastive Learning
Graph contrastive learning is one of the most widely used the unsupervised graph representation learning methods.DGI [27] first introduces deep infomax into graph learning and achieve satisfying results by maximizing the mutual information between local structure and global context.InfoGraph [13] improves DGI by stitching the representations of different layers together.GRACE [10] proposes a node-level graph contrastive learning method.The same node representations are pulled closer in the two views, while the different node representations are pushed away.BGRL [12] adopts a no-negative example method to encode and maximize the mutual information between the online encoder and the target encoder.GBT [14] uses identity matrices to approximate cross-correlation matrices to decouple the eigenvectors and reduce redundant information.
Data augmentation is one of the most important components of graph contrastive learning.The purpose of data augmentation is to create novel and reasonable data through some transformation.Most of existing techniques [9,10] achieve data augmentation by randomly perturbing edges, nodes and attributes.These models ignore the differences between nodes' and edges' importance in the graph.In order to solve the above problems, GCA [11] proposes an adaptive data augmentation method.It identifies important nodes in the graph by calculating the centrality measure of nodes and edges.Then, it perturbs unimportant nodes and edges with a high probability based on this centrality measure information.GROC [28] proposes a rule-based method to modify the edges.LG2AR [29] proposes a data augmentation method based on the distribution of all nodes in the graph.Another group of methods proposes to augment data by sub-graph sampling.MH-Aug [30] studies the expansion of a graph based on Markov chain Monte Carlo sampling.MVGRL [28] generates augmented sub-graphs based on graph diffusion technology.However, these methods ignore semantic information when calculating the importance of nodes and edges in graphs.

Problem Formulation
In this section, we first present some basic definitions and then formulate the problem.

Definition 1 (Geospatial graph).
A geospatial graph is denoted as G = (V, E), where each node v i ∈ V is a geographical location, identified by the latitude and longitude (x i , y i ) tuple.Given the threshold ω, for nodes v i and v j , if the distance between v i and v j is smaller than ω, there will be an edge e ij ∈ E between v i and v j .Each geospatial graph is associated with a feature matrix F ∈ R N×M , where M represents the feature dimension and F i ∈ R M represents the feature vector of v i .
In real-world applications, sometimes graphs are not associated with additional feature matrices.For those graphs, we define the feature vector of the node i as its coordinate vector Definition 2 (User Activities Set).A user activity is a tuple (o, v) that means user o visited location v.A user activities set D o is a set of activity tuples associated with user o.Total activities set as D = {D o |o ∈ O} include all users' activity sets.
Problem Statement.Given a geospatial graph G = (V, E) and the users' activities, our goal is to learn a representation matrix W ∈ R |V|×d , whose i-th row is a d dimensional vector representing the location v i ∈ V.The learned representation matrix W can be used as features for downstream tasks such as location classification.

Preliminary: Graph Contrastive Learning
In this subsection, we introduce more details about graph contrastive learning.The main idea behind contrastive learning is to generate two views, namely G 1 = τ 1 (G), G 2 = τ 2 (G) from the input graph G by data augmentation functions τ 1 and τ 2 , and maximize the mutual information between encoded representations of G 1 and G 2 .The corresponding objective function can be defined as max θ (MI(δ(G 1 ), δ(G 2 )), where δ(•) is a graph neural network that encodes graphs into nodal representations, θ represents the parameters of δ(•) and MI(•) is a function that calculates mutual information between δ(G 1 ) and δ(G 2 ).

Semantic Enhance-Graph Contrastive Learning (SE-GCL)
In this section, we first present two novel definitions, namely semantic geospatial graph and skeleton graph.The semantic geospatial graph helps us to address the challenge of less connected nodes by injecting semantic edges.The skeleton graph enables us to overlook the errors introduced by the positioning devices and allow us to focus on the primary structure of the geospatial graph.Next, we propose a graph contrastive learning method that learns a representation for both the semantic geospatial graph and the skeleton graph.Finally, we aggregate the two representations to obtain the final representation.

Data Preparation
The geospatial graph does not utilize the rich semantic information in the features and in the user activity set.To incorporate such semantic information, we construct a semantic geospatial graph as follows.
Definition 3 (Semantic Relationship).Given two nodes v i and v j from the geospatial graph, we say v i is semantically related to v j if either (1) the similarity between their features is larger than a threshold, i.e., cosine(F i , F j ) > γ, where γ is the threshold, or (2) there exists a user o that has visited both v i and v j , i.e., (o, v With the semantic relationship defined, we are ready to introduce the semantic geospatial graph.Definition 4 (Semantic Geospatial Graph).Given geospatial graph G(V, E), a semantic geospatial graph is denoted as G s = (V, E, E s ), where V and E are the same set of nodes and edges in G, and E s is the set of semantic edges, i.e., two nodes v i and v j are connected by an semantic edge, i.e., (v i , v j ) ∈ E s , if v i and v j are semantically related.We refer to E in G s as structural edges, and E s in G s as semantic edges.
Compared with a geospatial graph, a semantic geospatial graph is injected with many semantic edges.As a result, a less connected node in the geospatial graph is likely to be connected to other nodes through semantic edges, which addresses the challenge of sparse connection.
As positioning devices may introduce errors, we next propose the skeleton graph to only preserve the primary structure of the graph and disregard the fine-grained details.

Definition 5 (Skeleton graph). Given a semantic geospatial graph G
where each node u i ∈ U corresponds to a cluster V i ⊆ V, in which the distance between each node pair v m ∈ V i and v n ∈ V i is less than a given threshold ω c , E p is the set of structural edges and E s p is the set of semantic edges.The two nodes u i and u j are connected by a structural/semantic edge if there exist v i ∈ V i and v j ∈ V j such that v i and v j are connected by a structural/semantic edge, i.e., (v In order to construct the skeleton graph from the original geospatial graph efficiently, we adopt the following strategy: Firstly, we impose a grid on the space.The size of each cell is √ ω c .This guarantees that the distance between any two nodes in a cell is no larger than ω c .Secondly, for each cell, we merge the set V i of nodes inside as a new node u i in the skeleton graph.Finally, given two nodes u i and u j in the skeleton graph, let V i and V j be the corresponding node sets of u i and u j , respectively.If there exists v m ∈ V i and v n ∈ V j , such that (v m , v n ) is an edge in the original geospatial graph, we add an edge between u i and u j in the skeleton graph.Then, the skeleton graph is successfully constructed.Figure 2 shows an example of the semantic geospatial graph and the skeleton graph.Given a geospatial graph G = (V, E), we inject the semantic edges (red dashed lines) into G to generate a semantic geospatial graph G s = (V, E, E s ) (Figure 2a).To construct the skeleton graph, we classify the nodes in Figure 2a   In summary, the semantic geospatial graph helps us to address the challenge of less connected nodes by injecting semantic relations.The skeleton graph enables us to overlook the errors introduced by the positioning devices and allows us to focus on the primary structure of the geospatial graph.

Solution Overview
The high-level idea of our solution is as follows: For a given geospatial graph, we construct a semantic geospatial graph G s and a skeleton graph G p .Then, we use a graph contrastive learning method named SE-GCL to learn the representation matrices W s ∈ R |V|×d and W p ∈ R |U|×d for G s and G p , respectively.Once we have learned the two representations, we aggregate them to obtain the final node representation.
The framework of the graph contrastive learning model SE-GCL is shown in Figure 3.For a given graph (either the semantic geospatial graph or the skeleton graph), SE-GCL generates two views by data augmentation functions.After the views are generated, we propose HGAT to capture their structural and semantic information.The outputs of HGAT will be passed to a multi-layer perceptron (MLP) network to generate node representations.We use W s [i] to denote the representation of node v i in the semantic geospatial graph, and W p [j] to denote the representation of node u j in the skeleton graph.For a node v i in the geospatial graph, its final representation is defined as where u j is a node in the skeleton graph such that v i is the node in the node cluster V j ⊆ V that corresponds to u j .
The learning process is described on the right side of Figure 3.For each node in the views, SE-GCL aims to bring the positive samples closer and push the negative samples away.In this paper, we define the semantic neighbor nodes in the same view and the same node in different view as positive samples.The other nodes are regarded as negative samples.
In the remaining of this section, we introduce the graph contrastive learning method by elaborating on data augmentation (Section 4.3), HGAT (Section 4.4) and learning process (Section 4.5) in turn.As the learning processes of the semantic geospatial graph and the skeleton graph are independent, for the ease of illustration, we abuse the notation G = (V, E, E s ) to denote both graphs, where V is the set of nodes, E is the set of structural edges and E s is the set of semantic edges.

Data Augmentation
Randomly perturbing nodes and edges in the graph may sabotage the critical information in the graph.Therefore, we propose to augment data by considering the importance of each node.A natural idea to measure a node's importance is to calculate the centrality measure of the node [11,28,29].However, most of the existing centrality measure methods focus on homogeneous graphs.Since there are two types of edges in the semantic geospatial graph, it is inappropriate to adopt existing centrality measures directly.To address this problem, we define a novel mixed node centrality measure and propose four data augmentation methods based on the measure.

Mixed Centrality Measure
To design a good mixed centrality measure, we propose three semantic-aware measures, namely D-ClusterRank, D-DIL and D-CC.We next elaborate these measures in turn.D-ClusterRank measure.ClusterRank [31] is a centrality measure based on the local aggregation coefficient: where c i represents the aggregation coefficient of the target node v i , deg out j represents the out degree of the node neighbor v j and N i represents the neighborhood of v i .f (c i ) = 10 −c i is the nonlinear negative correlation function.Equation (3) depicts the computation of the aggregation coefficient c i for node v i , where R i represents the number of triangles formed with neighbors, deg i represents the degree of v i and deg i (deg i − 1)/2 represents the total number of triangles that make up a complete graph.
ClusterRank uses the degree centrality to measure the influence of each node, which treats each neighbor node equally.However, different nodes in the graph have different significance.Moreover, we need to consider two types of edges.Hence, it is inappropriate to use ClusterRank directly.To tackle this problem, we propose to improve ClusterRank as follows: Firstly, a well-known approach for capturing the significance of different nodes is PageRank [32].PageRank measures the significance of the nodes in a graph.The rank of each node is the probability of random walk to the node.To distinguish the semantic and structural edges, we propose the following measure: where w t j,k represents the connect edge number between v j and v k , w t j,k is the number of types of edges between v j and v k , d is the damping factor, N is the total number of nodes, 1−d N represents the probability random walk to each node and deg k is the degree of node v k .
Equation ( 4) evaluates the significance of nodes while taking into account the difference between semantic and structural edge.We next propose the improved D-ClusterRank.Specifically, we replace the node centrality measure in Equation ( 2) with the significance measure in Equation ( 4), as follows: where N strc i represents the structural neighborhood.
D-DIL measurement.DIL [33] suggests that nodes connected to important edges have a high probability of being important nodes.It computes the weighted sum of a node's degree and the importance of all connected edges: where deg i is the degree of node v i , (deg i +deg j −2) is the weight of the edge importance and I e ij is the importance of edge e ij .The importance I e ij is defined as follows: where p represents the number of triangles that the edge e ij participates in, and λ represents the weight coefficient.(deg i − p − 1)(deg j − p − 1) reflects the connectivity of edge e ij .The more triangles that e ij forms, the less important e ij is.Similar to ClusterRank, DIL does not distinguish semantic and structural edges, which is inappropriate in handling the semantic geospatial graph and the skeleton graph.To tackle this problem, we propose the D-DIL by considering both types of edges: where deg 1 i and deg 2 i are the numbers of structural and semantic edges that are connected to v i , w t j,k represents the number of edges between v j and v k and p is the number of triangles formed by the same type of edges as e ij .D-CC measurement.Closeness Centrality (CC) [34] measures the average shortest distance from each node to each other node: where dist ij is the shortest distance between node v i and node v j .Note that to compute the shortest distance, each edge on the shortest path has unit weight.As we have discussed in the D-ClusterRank and D-DIL, the CC measure does not consider the types of edges or the importance of each edge, making it inappropriate for our problem.To address this problem, we assign each edge a weight w(e ij ) = 1 , where t t i,j is the total number of all types of edges between v i and v j .Intuitively, if two nodes are connected by both structural and semantic edges, they are more important for the shortest distance.Then, we can define the D-CC as follows: where w-dist is the weighted distance between v i and v j .
Mixed centrality measure.Now, we are ready to present the mixed centrality measure.
To guarantee that the value of C i falls into the range of [0, 1], we normalize C i as where C min and C max are the minimum and maximum values, respectively.The mixed centrality measure of node v i is computed by: where σ(•) is the sigmoid function and β is the temperature parameter to adjust the distribution.

Augmentation Methods
The mixed centrality measure evaluates the significance of each node in the graph.Based on the mixed centrality measure, we next propose four data augmentation methods to preserve important information in the graph.We propose four augmentation methods, including Enhanced Ripple Random Walker (E-RRW), Centrality aware node perturbation (C-NP), Centrality aware Feature Masking (C-FM) and Centrality aware Edge Perturbation (C-EP).We next elaborate on the four methods in turn.

Enhanced Ripple Random Walker (E-RRW).
Ripple Random Walker (RRW) [35] is a subgraph sampling method.It solves the problem of neighbor explosion and node dependence in random walk, and further reduces resource occupation and computing cost.Motivated by the above advantages, we propose a novel data augmentation method, namely E-RRW.Specifically, we select the initial starting node based on the mixed centrality measure.Then, we generate augmented views by constructing sub-graphs with RRW.
Figure 4 shows the procedure of the E-RRW method.E-RRW generates two augmented sub-graphs from the original graph as follows: First, E-RRW selects the node with the largest mixed centrality as the initial node in the first sub-graph, denoted by v 1 init .After that, E-RRW collects v 1 init 's k-hop neighborhood N k init and calculates a score for each node where MCM j is the mixed centrality of v j and b is a constant.We select the node v j with the largest score as the initial node v 2 init in the second sub-graph, Starting from V 1 init , E-RRW randomly samples µ percentage of nodes from the unselected neighbors of the selected nodes, where the expansion ratio 0 ≤ µ ≤ 1 is the proportion of nodes sampled from the neighbors.When µ is close to 0, the ripple random walk acts like random sampling.When µ is close to 1, the ripple random walk acts like breadth-first search.We repeat the sampling process until the number of nodes in each sub-graph reaches a predefined threshold.The detail process of E-RRW is shown in Algorithm 1. First, E-RRW selects the initial nodes for the two sub-graphs (lines 1-2).Starting from the initial nodes, E-RRW expand the node sets of the two sub-graphs by RRW sampling (lines 13-18) from the original graph G (lines 3-4 and 7-12).Finally, E-RRW constructs the sub-graphs based on the extracted nodes (lines 5-6).
E-RRW has the following advantages: (1) It still preserves important nodes in the graph after sampling.(2) With a size constraint, E-RRW generates small-scale sub-graphs, which greatly reduces the burden of memory and computing resources in the training process.(3) E-RRW ensures that the two generated sub-graphs are very much alike, making the learning model easy to be optimized.

Centrality aware node perturbation (C-NP).
As shown in Figure 5a, the C-NP augmentation method deletes a fraction of nodes in the input graph based on the nodes' mixed centrality measure.As the nodes with higher mixed centrality measure are more important, we retain such nodes with a higher probability.Formally, we define a perturbing vector that is subjected to the Bernoulli distribution Perturb[i] ∼ Bern(MCM i ), where the probability of Perturb[i] = 1 is equal to MCM i , i.e., Prob(Perturb[i] = 1) = MCM i .Then, the C-NP augmentation method deletes the node v i with probability 1 − Perturb[i].

Centrality-aware feature masking (C-FM).
As shown in Figure 5b, the C-FM augmentation method masks a fraction of dimensions with zeros in node features.We assume that the features in the nodes with a large mixed centrality measure should be important, and define the masking probability of features based on the mixed centrality measure.Formally, we sample a random matrix M f m ∈ R N×M , where M is the feature dimension and N is the number of nodes.Each element in M f m is drawn from a Bernoulli distribution, i.e., M f m [i, j] ∼ Bern(MCM i ).The C-FM augmentation method masks the feature matrix by where • represents the dot product operation.

Centrality-aware edge perturbation (C-EP).
As shown in Figure 5c, C-EP augmentation adds or removes some edges in the graph.C-EP perturbs the edges in two steps: (1) For each edge e ij , we delete it with probability proportional to Bernoulli distribution, i.e., Prob(e ij ) ∼ Bern( ), where MCM i and MCM j are the mixed centrality of v i and v j , respectively.(2) For each pair of unconnected nodes v i and v j , we add an edge (v i , v j ) with probability proportional to Bernoulli distribution, i.e., Prob(e ij ) ∼ Bern( ).

Heterogeneous Graph Attention Network (HGAT)
We have presented data augmentation methods to generate sub-graphs as views for contrastive learning.In this subsection, we propose HGAT to capture extensive structural and semantic information from the generated views.Heterogeneous graphs are composed of different types of nodes and edges.The features of nodes and edges differ in types and dimensionality.Compared with general heterogeneous graphs, the semantic graph and the skeleton graph are special cases with their own properties.In either the semantic graph or the skeleton graph, there is only one type of node.In this paper, we need to aggregate the direct neighbors connected through different types of edge.Compared with traditional heterogeneous graph attention networks [36], it is a lightweight model that has fewer parameters and trains faster.
Before we present HGAT, we first introduce semantic edge feature vectors.The feature vector of a semantic edge e ij is a two-dimensional vector h e ij ∈ R 2 .Let us recall that a semantic edge connects two nodes v i and v j if either they share similar features, i.e., cosine(F i , F j ) > γ, or they have been visited by the same user o, i.e., (o, v i ) ∈ D o , (o, v j ) ∈ D o .Thus, if v i and v j share similar features, the first dimension of h e ij is defined as the similarity between their features, i.e., h e ij [0] = cosine(F i , F j ).Otherwise, the first dimension is defined as the number of users that have visited both v i and v j .In both cases, the second dimension is the shortest distance between v i and v j in the original geospatial graph.
As shown in Figure 6, HGAT calculates the representation for each node via an aggregation of information from structural neighbors, semantic neighbors and corresponding semantic edges as follows: where j is the weighted sum of the structural node representation of the target node v i at the l layer, ∑ k∈N sema i e ik ) is the aggregation of the semantic neighborhood node representation and the corresponding semantic edge representation at l layer, h (l) j is the representation of v j at the l-th layer, h e ik = h e ik is the features of the semantic edge e ik , σ(•) is the activation function, W β and W γ are parameters to be learned, α r ij is the weight of the structural neighbor and α m ik is the weight of semantic neighbors and the corresponding semantic edges where N stru i , N sema i are the structural and semantic neighbors of node i and W are a parameters to be learned.
Finally, the outputs of HGAT are encoded by a MAP layer, i.e., where σ(•) is the activation function, W (1) and W (2) are parameters to be learned.Please note that SE-GCL generates two views from the input graph by data augmentation and then maximizes the mutual information between encoded representations (i.e., the outputs of the MLP layer) of the two views.Let z i and N (v i ) denote encoded representation and the semantic neighbors of node v i in one of the view, and z i and N (v i ) denote encoded representation and the semantic neighbors of v i in the other view.Given a representation z i , its positive examples include z i , {z j |v j ∈ N (v i )} and {z k |v k ∈ N (v i )}.The negative samples consist of two parts: the set of all nodes except v i , denoted by M(v i ), and v i 's semantic neighbors in the same and different views, denoted by M (v i ).The objective function for positive examples in different views is defined as follows: where e θ(z i ,z j )/η is the similarity between the representation of the same (semantic related) nodes in different views, ∑ v j ∈M(v i ) e θ(z i ,z j ) represents the similarity between v i and its negative examples in same view, ∑ v k ∈M (v i ) e θ(z i ,z k ) represents the similarity between v i and its negative examples in different views, θ(•) is the cosine similarity function and η is the temperature parameter.The objective function for each positive sample in the same view is calculated by: where e θ(z i ,z j )/η represents the similarity between the representation of semantic-related nodes in the same view.The objective function of a view is defined as follows: The total objective function of SE-GCL is the sum of the objective functions of both views.Given a semantic geospatial graph or skeleton graph, SE-GCL generates two views by E-RRW, C-NP, C-FM and C-EP in turn.The views are then transformed into the HGAT network.The outputs of HGAT are passed to a multi-layer perceptron (MLP) network to generate final representations of nodes.Finally, the contrastive training process constantly adjusts the parameters to shorten the distance between positive sample pairs, while pushing the distance between negative sample pairs.

Experiments
In this section, we conduct extensive experiments on three real-world geospatial datasets to demonstrate the effectiveness of our model.We begin with a brief introduction of the experimental setup, and then we present experimental results in comparison with the state-of-the-art baselines.After that, we perform ablation experiments to verify the validity of the modules in our model.Finally, we conduct extensive experiments on several non-geospatial datasets to evaluate the scalability of SE-GCL.

Datasets
We conduct experiments on three datasets including Gowalla, Brightkite and Nanjing POI.
Gowalla and Brightkite.Gowalla and Brightkite [37] are social network datasets based on geolocation information, mainly composed of users' check-in and geolocation information.We extract the maximum connected sub-graph from the original graph in our experimental study.
Nanjing POI.The Nanjing POI dataset covers seven types of geographical location points: catering, public facilities, companies, medical treatment, accommodation, government and transportation facilities.Each record in the dataset consists of the following six parts: geolocation point type, longitude, latitude, province, city and street.The graph contains a total of 12,004 position nodes, 4,004,346 structural edges and 59,953 semantic edges.
The details of the datasets are reported in Table 1.

Experiment Setup
Evaluation tasks.In this experimental study, we consider three tasks to evaluate the effectiveness of our proposed approach.
• Node classification.In the Nanjing POI dataset, each node in the geospatial graph corresponds to a POI in real life.Each node is associated with a category attribute, which indicates the type of POI.The node classification task classifies the nodes based on their category.To evaluate the quality of the results, we employ F1-Macro and F1-Micro as the metrics.F1-Macro score is the unweighted mean of the F1 scores calculated per class.F1-Micro score is the normal F1 formula but calculated using the total number of true positives (TPs), false positives (FPs) and false negatives (FNs), instead of individually for each class.• Node clustering.In the Brightkite and Gowalla datasets, each node in the geospatial graph corresponds to a location.Users have recorded their visits to different locations in the past.The node-clustering task divides the locations into multiple disjoint groups, such that the locations in the same group that are likely to be visited by the same set of users.To evaluate the quality of the clustering results, we propose adjusted purity Q(C) as metric for the clustering task.Adjusted purity Q(C) is a novel metric proposed in this paper to evaluate the quality of clustering results.Intuitively, the locations visited by the same user are likely to be similar.Hence, we assume that the locations visited by the same user belong to the same cluster.Given n clusters {C 1 , C 2 , . . ., C n }, the adjusted purity Q(C) score is defined as: where L(u) is the set of locations that u has visited and C i represents the i − th clustering set.
Parameters and Baselines.For all datasets, SE-GCL uses a two-layer HGAT as the encoder, and Adam as the model optimizer, the number of training batches is set to 100 and the dimension of MLP layers is set to 16.
We compare SE-GCL with the following two types of models: (1) Supervised graph neural network methods: GCN [23], GAT [24] and GraphSAGE [25], (2) Graph contrastive learning methods: GRACE [10], GCA [14], ProGCL [38], BGRL [12] and GBT [14].For all baseline methods, we adopt the parameters suggested in their papers.In the classification tasks, we alternate logistic regression, SVM and random forest as classifiers, and report the best performing results.The input data are randomly divided into training, testing and validation sets with a ratio of 7:2:1.For all the experiments, except the ablation study (effect of augmentations), we apply all the four methods to the graph.The ratio of the E-RRW is set to 0.2.The ratios of the C-NP, C-EP and C-FM are set to 0.3.

Overall Evaluation
Table 2 the shows the adjusted purity Q(C) of all comparison unsupervised representation methods in the node-clustering task in Gowalla and Brightkite.We observe that SE-GCL achieves the best performance on both datasets.Specifically, it is 21.03% and 13.33% higher than the baselines on Govalla and Brightkite on average, respectively.SE-GCL is higher than supervised representation methods GRACE, GCA and ProGCL by 23.15%, 18.49% and 20.24% on the Gowalla dataset, and 13.01%, 11.73% and 13.95% on the Brightkite dataset, respectively.Compared with negative example-based graph contrastive learning models, SE-GCL is 18.66% and 24.59% higher than BGRL and GBT on the Gowalla dataset, and 12.3% and 15.65% on the Brightkite dataset, respectively.Existing contrastive learning methods only consider the structural neighbors and ignore the semantic information.Hence, they perform worse than the proposed method.Therefore, the results demonstrate the superiority of our proposed approach in the node-clustering task.Table 3 shows the node classification task on the Nanjing POI dataset.We observe graph neural network models perform worse than graph contrastive learning models.GCN is the worst method.BGRL is the best baseline in the classification task.The proposed SE-GCL achieves the best performance.It is 3.08% and 5.54% higher than the best baseline BGRL with regard to F1-Micro and F1-Macro, respectively, and 16.84% and 30.95% higher the worst baseline GCN with regard to F1-Micro and F1-Macro, respectively.Figure 7 shows the confusion matrix of SE-GCL on the Nanjing POI dataset.The confusion matrix is an error matrix commonly used to visualize the classification performance of a model, where the value of the diagonal element represents the classification accuracy of a particular category.The larger the value of the element on the diagonal of the confusion matrix diagram, the darker the square color on the diagonal.From Figure 7, we observe that the color of the square in the diagonal line is darker, while the color of other areas is lighter, which reflects that SE-GCL has achieved a good classification effect for each node category.

Effects of Data Augmentations
In order to verify the effectiveness of proposed data augmentation methods, we conducted extensive experiments on different data augmentation strategies.
Table 4 shows the comparison of proposed data augmentations and existing data augmentations in the clustering tasks.We see that the performance of the improved data augmentations are better than that of the original data augmentations.The Q(C) score of E-RRW is 13% higher than that of RRW.C-NP + C-FM + C-EP is 1.93% higher than NP + FM + EP.E-RRW + C-NP + C-FM + C-EP is 12.6% higher than RRW + NP + FM + EP.In the Brightkite dataset, E-RRW is 5.88% higher than that of the RRW strategy.C-NP + C-FM + C-EP is 6.46% higher than NP + FM + EP.E-RRW + C-NP + C-FM + C-EP is 8% higher than RRW + AM + ND + ER.
Table 5 shows the comparison of the proposed data augmentations and existing data augmentations in the classification tasks.We observe that the F1-Micro and F1-Macro of E-RRW are 3.41% and 4.98% higher than RRW, respectively.The F1-Macro of E-RRW + C-NP + C-FM + C-EP is 5.33% higher than RRW+NP + FM + EP.This is because E-RRW samples the sub-graphs based on the mixed centrality measure, which preserves the important structural and semantic information in the graph.On the contrary, the original RRW samples sub-graphs based on random walk, which may break the critical information.In addition, we observe that combining different data augmentations improves the performance of the model.Combining all the proposed data augmentations achieves the best performance.In summary, the proposed mixed centrality measure can improve the data augmentation method.All data augmentation methods contribute to the improvement the performance.

Effects of Encoding Networks
To verify the effectiveness of the proposed encoding network, we replace HGAT with GAT and show the comparison of their experimental results in Tables 6 and 7.
As can be seen in Tables 6 and 7, HGAT achieves the best performance in all tasks.In the clustering tasks, the Q(C) scores of the SE-GCL with the HGAT encoder are 7% and 10% higher than the SE-GCL with the GAT encoder in the Gowalla and Brightkite datasets, respectively.In classification tasks, HGAT is 1.75% and 2.79% higher than GAT with regard to F1-Micro and F1-Macro, respectively.This is because the traditional GAT encoder ignores the heterogeneous information in the graph and has relatively limited information capture ability.The above experiments show that the HGAT encoder is able to capture useful information more effectively in classification and clustering tasks compared to the traditional GAT.Traditional contrastive learning methods take the representations of the same node under different views as positive samples.In Section 4.5, we propose to take the representations of semantic neighbors as well as the same node as the positive samples.In this set of experiments, we evaluate the effects of different learning methods.
Tables 8 and 9 show the comparison of the proposed learning method and traditional methods.We observe that the proposed learning method outperforms the traditional learning method in all tasks.In the clustering task, the Q(C) scores of the proposed leaning method are 15.58% and 10.63% higher than the traditional method in the Gowalla and Brightkite datasets, respectively.In the node classification task, the proposed method is 2.75% and 3.51% higher than the traditional method with regard to F1-Micro and F1-Macro, respectively.These results show that the learning method designed in this paper can better explore the relationship between semantic-related nodes and further achieve better performance in downstream tasks.

Performance on Non-Geospatial Graphs
The previous results have shown the effectiveness of our proposed approach in both the node classification task and the node clustering task on geospatial graphs.Our approach is a general approach and can also be applied to non-geospatial graphs.In this subsection, we evaluate the proposed approach on non-geospatial graphs with two tasks: node classification and graph classification.

Node Classification in Non-Geospatial Graphs
Experimental setting.In this set of experiments, we use the F1-score as the evaluation metric.We use five non-geospatial datasets in node classification tasks.The Cora [39] dataset and the Wiki-CS [40] dataset are citation datasets based on citations to papers and Wikipedia entries, respectively.The Cora dataset covers a total of seven different categories, and each node has 1433 feature dimensions.Wiki-CS covers 10 different categories of data samples, with a total of 11,701 edges and 216,123 edges.The Amazon-Computers [41] dataset is a syndicated purchase graph extracted from the Amazon platform, with nodes representing goods and edges representing syndicated purchase relationships between goods, including 10 different node categories.The Darknet [42] dataset covers eight specific application types in normal traffic and malicious traffic, and we sample 20% from the original dataset in the experiments.The details of these datasets are shown in Table 10.Overall evaluation.Table 11 shows the experimental results of node classification.We observe that SE-GCL outperforms all baselines on all datasets.First, the graph neural network models (GCN, GAT and GraphSAGE) are 7.96%, 4.87%, 9.12% and 6.97% lower than the SE-GCL model on Darknet, WikiCS, Cora and Amazon_Computers on average, respectively, which shows that SE-GCL is better than the graph neural network models.Second, the performance of SE-GCL is much higher than that of commonly used contrastive learning models.The F1-scores obtained by SE-GCL on the Darknet, WikiCS, Cora and Amazon-Computers datasets are 3.37%, 6.41%, 4.79% and 5.58% higher than GCA, respectively.As an improved model of GCA, the ProGCL model achieves the best performance
into four clusters based on the location distance.Each cluster correspond to a node in G p , i.e., A 2 = {A}, B 2 = {B, C, D}, C 2 = {E}, D 2 = {F, G, H}.Based on the definition of the skeleton graph, the node pairs (A 2 ,B 2 ), (B 2 , C 2 ), (C 2 , D 2 ) are connected by a structural edge, (A 2 ,D 2 ), (B 2 , D 2 ) and (C 2 , D 2 ) are connected by a semantic edge.Given a node v, we refer to the set of nodes connected to v via structural edges as structural neighbors, and the set of nodes connected to v via semantic edges as the semantic neighbors.

Figure 2 .
Figure 2.An example of semantic geospatial graph and skeleton graph.The structural edges are colored in black, while the semantic edges are marked as red dashed lines.

Figure 3 .
Figure 3.The proposed framework for SE-GCL model.The left side is the overall framework.The right side is the detailed learning process.

Figure 4 .
Figure 4. E-RRW data augmentation method.The red nodes represent the initial nodes.The yellow nodes represent the first step.The green nodes represent the second step.The orange nodes represent the third step.

Figure 5 .
Figure 5. Centrality-aware node perturbation, feature masking and edge perturbation.The black lines represent structural edges and the red dashed lines represent semantic edges.

Figure 6 .
Figure 6.HGAT: the green and yellow circles represent the structural neighbors and semantic neighbors; the blue circles represent the semantic edges; α r ij and α m ij represent the attention weights.

4. 5 .
Contrastive Learning Most of the existing contrastive learning methods [9-11] take the same node under different views as positive examples.However, these methods overlook the problem of insufficient number of positive examples.To solve this problem, we take the semantic neighbors as the positive examples to expand the number of positive examples.

Figure 7 .
Figure 7. Confusion matrix of node classification on Nanjing POI dataset.

Table 1 .
Summary of dataset statistic.

Table 2 .
Performance of unsupervised representation methods in location clustering.

Table 3 .
Performance of models in location classification.

Table 4 .
Comparison of proposed data augmentations and existing data augmentations in clustering task.

Table 5 .
Comparison of data augmentations in classification task (Nanjing POI dataset).

Table 6 .
Comparison of encoding networks in clustering task.

Table 7 .
Comparison of encoding networks in classification task (Nanjing POI dataset).

Table 8 .
Performance of different learning methods in clustering task.

Table 9 .
Performance of different learning methods in classification task (Nanjing POI dataset).

Table 10 .
Summary of dataset statistics.