Self-Supervised Spatio-Temporal Graph Learning for Point-of-Interest Recommendation

: As one of the most crucial topics in the recommendation system ﬁeld, point-of-interest (POI) recommendation aims to recommending potential interesting POIs to users. Recently, graph neural networks have been successfully used to model interaction and spatio-temporal information in POI recommendations, but the data sparsity of POI recommendations affects the training of GNNs. Although some existing GNN-based POI recommendation approaches try to use social relationships or user attributes to alleviate the data sparsity problem, such auxiliary information is not always available for privacy reasons. Self-supervised learning provides a new idea to alleviate the data sparsity problem, but most existing self-supervised recommendation methods are designed for bi-partite graphs or social graphs, and cannot be directly used in the spatio-temporal graph of POI recommendations. In this paper, we propose a new method named SSTGL to combine self-supervised learning and GNN-based POI recommendation for the ﬁrst time. SSTGL is empowered with spatio-temporal-aware strategies in the data augmentation and pre-text task stages, respectively, so that it can provide high-quality supervision information by incorporating spatio-temporal prior knowledge. By combining self-supervised learning objective with recommendation objectives, SSTGL can improve the performance of GNN-based POI recommendations. Extensive experiments on three POI recommendation datasets demonstrate the effectiveness of SSTGL, which performed better than existing mainstream methods.


Introduction
With the development of wireless communication and satellite positioning technologies, location-based services (LBSs) have become widely available in people's daily lives.LBSs can access the geographic coordinates of users through mobile devices and integrate them with other information (e.g., time and user preference) for services.Point-of-interest (POI) recommendation is a representative task in LBSs.Unlike typical recommendations that only consider the historical interaction between users and items, POI recommendations also need to consider the impact of spatio-temporal information [1] (e.g., the location coordinates of the POI, and the time of interaction between them).
Over the last decade, POI recommendation algorithms have evolved from early spatiotemporal-aware matrix factorization techniques [2][3][4][5] to recent approaches based on spatiotemporal graph representation learning [6][7][8][9][10][11].As the most advanced graph representation learning technique, graph neural networks (GNNs), especially spatio-temporal graph neural networks (STGNNs), have been successfully applied to POI recommendations [9][10][11].For example, as the first work applying a GNN to a POI recommendation, GPR [9] constructed additional POI-POI graphs based on the proximity of interaction times and uses a separate module to learn the representation of physical distances.As with other recommendation scenarios, POI recommendations also suffer from severe data sparsity [10].
Since GNN-based models usually rely on topology for message passing, the sparsity of interactions means that many nodes (especially long-tail nodes) cannot learn high-quality representations and are susceptible to interaction noise.To mitigate the impact of data sparsity on GNNs, some works have introduced various auxiliary information, including social relations [10] and user attributes [11].However, as user privacy is taken more seriously these days, such auxiliary information is not always available.Therefore, there is an urgent need to explore new methods to reduce the impact of data sparsity problems on GNN-based approaches.
The recent emergence of self-supervised learning techniques offers a new direction to this problem.Through data augmentation and design pretext tasks, self-supervised learning can provide additional supervision signals to improve the performance of recommendation algorithms.In fact, some recent works have attempted to combine selfsupervised learning with GNNs for other recommendation tasks (e.g., product recommendation [12,13], social recommendation [14], and session-based recommendation [15]).For example, SGL [12] designed three graph structure-based data augmentation methods, and NCL [13] designed graph structure-based and semantic prototype-based pretext tasks.However, these methods are specialized for bi-partite graphs, social graphs, or sessionbased hypergraphs, and do not consider spatio-temporal information.Thus, there is still a gap to adopt self-supervised learning for POI recommendation algorithms based on spatio-temporal graphs.
In this paper, we explore for the first time how self-supervised learning can be applied to GNN-based POI recommendations and design a general framework named selfsupervised spatio-temporal graph learning (SSTGL).Specifically, SSTGL combines spatiotemporal prior knowledge in two ways, i.e., via data augmentation and the pre-text task.We first define the temporal and spatial similarity of POIs based on the interaction time between POIs and users as well as the geographical location of POIs.Then, considering that users may be interested in POIs that are similar to the interacted POIs, SSTGL adds some implicit edges to users and POIs based on the similarity function between POIs, which implements spatial or temporal-based data augmentation.Finally, SSTGL randomly drops some edges with a certain ratio to alleviate the data sparsity problem.For pre-text tasks, SSTGL uses spatio-temporal similarity to guide the consistency between node representations.Finally, we optimize the pre-text task, together with the recommendation ranking task, to improve the performance of the POI recommendation.
Experiments on three datasets show that our proposed model outperformed existing GNN-based POI recommendation algorithms.The relative improvements of Recall@50 were 6.32%, 13.27%, and 9.68%.In addition, ablation experiments and hyper-parameter experiments further demonstrated the robustness of our model.
The contributions of this paper are summarized as follows: • To the best of our knowledge, this is the first attempt to design a self-supervised learning-based framework to improve GNN-based POI recommendation algorithms.

•
We propose data augmentation strategies and pre-text tasks of the proposed framework, which model spatial or temporal prior knowledge from different perspectives.

•
We conducted experiments on three POI recommendation datasets and verified that our model could improve GNN-based POI recommendations and outperform existing state-of-the-art methods.

Related Works
In this section, we review two related fields: point-of-interest (POI) recommendation and self-supervised learning (SSL).

Point-of-Interest Recommendation
Most of the traditional POI recommendation models are based on matrix factorization [2][3][4][5].LRT [2] focused on the impact of temporal information on POI recommendation, while IRenMF [4] exploited geographic location information to model each location's neigh-bors at both geographic location and geographic region levels.In addition, LGLMF [3] used local geographic information to obtain popular locations within the user's primary activity area.STACP [5] considered both geographic and temporal information, and they studied the behavior of users at different periods.Recently, some methods based on graph neural networks have been proposed [9][10][11].For example, GPR [9] built the POI-POI graph by connecting adjacent POIs in the interaction sequence of users and POIs, and they used an exponential function to measure the physical distance between POIs to learn POI representations.HGMAP [10] additionally constructed a user-user graph based on social relationships, as well as a POI-POI graph based on geographic location, and combined the embedding of multiple graphs to obtain multiple user preference scores.GEAPR [11] learned user representations with the help of several factors, including user attributes, and used attention mechanisms to achieve interpretable recommendations.Data sparsity poses a great challenge to GNN-based recommendation [12].Although social information and node attributes can alleviate the data sparsity problem to some extent, this auxiliary information is often not available due to the need to protect user privacy.

Self-Supervised Learning
Recently, as an effective way to alleviate the data sparsity problem, self-supervised learning (SSL) has been widely used in computer vision [16], natural language processing [17] and graph-based tasks, including various GNN-based recommendation tasks [12][13][14][15].For product recommendation, SGL [12] applies SSL in recommendation tasks by changing the graph structure through a dropout and random walk strategy, as well as by maximizing the mutual information (MI) between multiple embeddings of the same node while minimizing the MI between the embeddings of different nodes.NCL [13] proposed structure-based and prototype-based contrastive learning objectives, which were used to improve graph collaborative filtering methods.For social recommendation, MHCN [14] designed a hypergraph convolutional network based on social relations and used hierarchical mutual information maximization to recover connectivity information in the hypergraph convolutional network.For session-based recommendation, S 2 -DHCN [15] proposed a two-channel hypergraph convolutional network and maximized the mutual information between the learned session representations of both channels.However, these methods are not designed for POI recommendation, and they lack the use of spatio-temporal information, thus making them not applicable to spatio-temporal graphs.

Methodology
In this section, we first give the problem definition for the POI recommendation, and then introduce our proposed model SSTGL.As illustrated in Figure 1, our method consists of three main components, namely, the graph neural network backbone, the spatiotemporal-aware data augmentation, and the spatio-temporal-aware pre-text task.
In detail, the role of the graph neural network backbone is to learn the node representations from the user-POI graph G and use them for the recommendation loss L main and the generation of the final recommendation list PU .Spatio-temporal aware data augmentation aims to generate multiple augmented graphs (i.e., G and G ) based on spatio-temporal prior knowledge, and these augmented graphs are also fed into the GNN backbone to generate multiple enhanced node representations.Spatio-temporal-aware pre-text tasks use spatio-temporal similarity to design optimization objectives L ssl for self-supervised learning, thus providing effective self-supervised information.

Problem Definition and GNN Backbone
In this paper, we modeled the POI recommendation as a link prediction problem on a bi-partite graph G = (V, E ), where users U , POIs P, nodes V, and historical interactions R are between them as edges E .Our goal was to find potential edges based on the observed edges and spatio-temporal information.
The GNN backbone module aims to generate the nodes' embeddings Z U and Z P by training a GNN function f (•|X U , X P ) in a point-wise or pair-wise loss L main , where X U and X P are the nodes' features.Then, this module calculates the similarity score ŶUP between their embeddings and uses the top-K POIs with the highest similarity scores as the recommendation list PU .
In detail, graph neural networks usually use the message-passing mechanism to generate node representations.It consists of two specific steps.Firstly, given the bi-partite graph G, the (l + 1)-th layer representations of the nodes are updated by aggregating the l-th layer representations of their neighbor nodes: where Then, the representations of all the L layers are fused to generate the final representations: Although any f aggregate and f readout can be used as a GNN backbone, for the sake of fair comparison in subsequent experiments, SSTGL uses the same functions and f readout (•|z 1 → z 2 ) as the self-supervised learning baselines [12,13] of the backbone, which can be defined as follows: Finally, the similarity scores ŶUP of the representations of users and POIs are calculated using the inner product and are used to calculate the supervision loss L main : where + is positive pairs with interaction, and O − is negative pairs without interaction.

Spatio-Temporal-Aware Data Augmentation
Some straightforward data augmentation operators have been proposed in the former work [12], including the node dropout, edge dropout, and random walk.However, these operators ignore the effect of spatio-temporal information, thus resulting in some lowquality graph structures, which in turn reduce the performance of POI recommendations.
To take spatio-temporal information into account during data augmentation, in this section, we first defined the POI's temporal and spatial similarity matrices separately.Based on these similarity matrices, we defined the spatial-aware edge perturbation and temporal-aware edge perturbation operations.SSTGL uses these edge perturbation methods randomly at each epoch of training to generate new augmented graphs.
In detail, based on Equation (1), the aggregation operation based on the data augmentation can be expressed as follows: where s 1 (•) and s 2 (•) are data augmentation operators.Note that, although the data augmentation can generate any number of views, SSTGL uses only two data augmentation operations to reduce the model complexity:

•
Spatial similarity matrix M S ∈ {0, 1} |P |×|P | : when the distance between two POIs is less than a certain threshold K S , then the similarity of these two POIs is 1; otherwise, it is 0. • Temporal similarity matrix M T ∈ {0, 1} |P |×|P | : when two POIs have interacted with the same user in a period K T , then the similarity of these two POIs is 1; otherwise, it is 0.
In practice, we set K T to 2 h in following the former work [18], which discovered that users tended to visit the same POI consecutively within 2 h.In addition, we used the Geohash algorithm to transform the geographic coordinates of the POIs into region IDs and then determined whether the IDs were the same as a threshold condition of K S .
Based on the spatial and temporal similarity matrices, we defined the following data augmentation operators: • Spatial-aware edge perturbation (SEP): It adds multiple implicit edges based on the spatial similarity to the original user-POI edges: where Θ 1 (a, b), and Θ 2 (a, b) are perturbation vectors that pick elements from a in some ratio r θ and add it to b.

•
Temporal-aware edge perturbation (TEP): It adds multiple implicit edges based on temporal similarity to the original user-POI edges: In following [12], we used the above data augmentation approaches at each epoch to generate multiple views and used the same ratio r θ on Θ 1 and Θ 2 .We leave it as future work to use more than two data augmentation operators s(•) simultaneously and to use different perturbation ratios r θ for different operators.

Spatio-Temporal-Aware Pre-Text Task
For contrastive self-supervised learning, the pre-text tasks L ssl are defined based on positive and negative pairs.Existing methods usually treat representations of the same node with different augmentation methods [12] or with different GNN layers [13] as positive pairs, as well as different nodes using different augmentation methods [12] or different GNN layers [13] as negative pairs; neither of these approaches takes into account the spatio-temporal information.In this section, we propose spatial-aware and temporalaware pre-text tasks, which are further used to define spatial-aware and temporal-aware contrastive learning objectives.
In detail, we believe the spatio-temporal prior knowledge gives some clues to selecting positive and negative samples.For example, POIs with similar spatio-temporal properties are more suitable for positive samples than negative samples.Due to data sparsity, these similar POIs may not have interacted with the same user and could easily be mistaken as negative pairs, thus reducing the recommendation performance.Therefore, we designed spatial-aware and temporal-aware pre-text tasks, which are defined as follows: Based on the above definitions, we proposed the spatial-aware and temporal-aware contrastive learning objectives: • Spatial-aware contrastive learning (SCL): Maximizes the MI between spatial-aware positive POI pairs and minimizes the MI between spatial-aware negative POI pairs: where s(•) is a similarity measure function, which is set as a dot product; τ is the temperature in softmax; and (•) 1 and (•) 2 represent different embeddings obtained from data augmentation.
• Temporal-aware contrastive learning (TCL): Maximizes the MI between temporalaware positive POI pairs and minimizes the MI between temporal-aware negative POI pairs.
In practice, we set Q S and Q T to 1; that is, SSTGL only used the target node as a positive example.We leave larger scales or different Q S and Q T values to future work.In addition, since SSTGL did not define a user-based spatio-temporal similarity matrix, it used a spatio-temporal independent contrastive learning objective for users as in [12]: Using POI-based and user-based self-supervised objective functions, we defined the final self-supervised optimization objective as follows: where α is a hyper-parameter used to balance two losses, and L P ∈ {L P S , L P T }.

Model Training
To use self-supervised signals to improve the performance of the POI recommendations, SSTGL uses a multi-task training strategy to optimize the ranking loss and contrastive learning loss jointly: where λ 1 and λ 2 are hyper-parameters used to control the strengths of the SSL and regularization term, and Φ denotes the set of GNN parameters.

Complexity Analyses of SSTGL
We show the algorithm in Algorithm 1.Since SSTGL introduces no trainable parameters, its and space complexity remain the same as the GNN backbone.In addition, without any change to the neural network structure, its inference time complexity also remains the same as the GNN backbone.Therefore, the main extra time complexity comes from computing the self-supervised loss.Let |E| and |V| be the number of edges and nodes in the user-item graph, respectively, s denote the number of epochs, B denote the batch size, and d denote the embedding size.The time complexity of the whole training phase is O(|E|d(2 + |V|s)), which is the same as an existing self-supervised GNN model [12].

Algorithm 1 The framework of SSTGL
Require: Given the original user-POI graph G = (V, E ), the layer L of GNN model.Ensure: The similarity scores ŶUP .
1: Initialize all node embeddings Z (0) and compute similarity scores by Equation ( 7); 2: while not converge do 3: for each layer l in [0, . . ., L−1] do For each dataset, we chose the oldest 70% of interactions of each user as the training data and the newest 20% of interactions as the test data.The remaining 10% were used as validation data.

Baselines
We compared SSTGL with the following ten models, which can be classified according to Table 2: • NeuMF [20]: NeuMF is a classical MF-based model that combines matrix factorization and multi-layer perceptron to learn both low-dimensional and high-dimensional embeddings.• NGCF [21]: NGCF is a GNN-based model capturing high-order information through message passing and aggregation.• DGCF [22]: DGCF is a GNN-based model, which models different relationships and separates user intents in the representation.

•
LightGCN [23]: LightGCN is a GNN-based recommendation model, which simplifies the aggregation step by deleting the weight matrix and activation function.• SGL [12]: SGL is a graph-based self-supervised method that proposes three data augmentation strategies based on the graph structure.• NCL [13]: NCL is a graph-based contrastive learning method that improves neural graph collaborative filtering by considering structural and semantic neighbors. • LGLMF [3]: LGLMF is an MF-based POI recommendation model, which combines logistic matrix factorization with a region-based geographical model.• STACP [5]: STACP is also an MF-based POI recommendation model, which combines matrix factorization with a spatio-temporal activity-centers algorithm.• GPR [9]: GPR is a GNN-based model designed for POI recommendation that uses an extra POI-POI graph to learn item embeddings and improve performance.• MPGRec [24]: MPGRec is the newest GNN-based POI recommendation model, which uses a dynamic memory module to store global information for spatial consistency.
Table 2. Category of baselines and SSTGL.

Evaluation Metrics
To evaluate the model's performance, we adopted two general metrics, Recall@N and mean average precision (MAP@N), where N means the top-N POIs recommended by the model.In our experiments, N was set to 5, 10, 20, and 50 for a comprehensive comparison.

Performance Comparison (RQ1)
The experimental results on the three datasets are shown in Table 3, Table 4 and Table 5, respectively.Note that there were four different variants of SSTGL depending on the data augmentation and the choice of the pre-text task:  In detail, we used dropedge [12] as the non-spatial-temporal data augmentation, and we defined the non-spatio-temporal pre-text task of the POI in a similar way to Equation (14).We present the best results among these variants with respect to the performance of SSTGL in the tables.The differences in performance between the variants are analyzed in the next section.
Based on the experimental results, we have the following observations: • SSTGL outperformed all baseline methods in most cases.In particular, the relative improvements from the strongest baselines were 6.32% (Foursquare), 13.27% (Gowalla), and 9.68% (Meituan) using the Recall@50 metric.Note that SSTGL not only worked better than the existing POI recommendation methods, but also better than the existing self-supervised graph learning methods.This demonstrates the ability of our model to use self-supervised learning to alleviate the data sparsity problem in the POI recommendation task.Although MPGRec performed better in some cases, it relies on a dynamic memory module, which requires a large memory overhead.

•
For the baseline models, the GNN models did not always outperform the MF models, which was related to the datasets and model architectures.For example, we found that the NeuMF model performed better than some GNN-based methods for the Meituan dataset.This may be due to the low sparsity of the Meituan dataset and the more personalized interests of the users in the take-out scenario, so aggregating higher-order neighborhood information would instead reduce the performance.

Ablation Study (RQ2)
To explore the effects of different data augmentation and pre-text tasks on the model performance, we show the results of four model variants in Figure 2. Based on the results, we have the following observations:

•
In the strategies we designed, the temporal-based approaches (i.e., TEP and TCL) worked better on the Foursquare and Gowalla datasets, and the spatial-aware approaches (i.e., SEP and SCL) performed better on the Meituan dataset.This may indicate that the spatial factor has a greater influence on the Meituan dataset compared with other datasets.

•
Although both consider spatio-temporal information, the data augmentation-based approach outperformed the pre-text-task-based approach.This may be due to the direct modification of the graph structure using the self-supervised method of data augmentation, which allows the node representation of the GNN output to make better use of spatio-temporal prior knowledge.

Influence of Hyper-Parameters (RQ3)
Our model contained three main hyper-parameters (i.e., the drop ratio r θ , the sample ratio ρ, and the temperature τ).In Figure 3, we show the performance of the model for the Foursquare dataset with different values of hyper-parameters.We have the following observations: • Overall, the different drop ratio r θ and sample ratio ρ had little effect on the model results, which indicates the robustness of the model.

•
Too large or too small SSL temperature τ values reduced the performance.This observation is consistent with the previous work [12].The possible reason behind this is that, if the temperature is large, it is more difficult to distinguish negative examples.If the temperature is small, only a small number of negative cases affect the optimization.

Discussion
Based on our experiments, we found that existing self-supervised learning strategies did not always improve the GNN models.For example, the SGL performed worse than the LightGCN in most cases.This indicates that data augmentation approaches that do not take into account the spatio-temporal information may not help POI recommendations to alleviate the data sparsity problem.In contrast, our model took spatio-temporal information into account when designing the self-supervised method, so the performance was improved.
In addition, we were surprised to find that the spatio-temporal approaches did not always outperform the non-spatio-temporal models.In particular, the STACP and LGLMF performed extremely poorly on the Meituan dataset, which may be due to the fact that these two methods rely on the geographic coordinates of the POI, while the Meituan dataset only has the region ID of the POI.Since POI recommendation algorithms are often not compared with non-spatio-temporal recommendation algorithms, many works ignore that recommendation algorithms based on bi-partite graphs can also be very strong baselines.In practical application, this also inspires us to fuse the spatio-temporal information for the POI recommendation based on the existing non-spatio-temporal GNNs (e.g., LightGCN), which may be more effective and efficient than designing completely new spatio-temporal GNNs (e.g., GPR).

Conclusions
In this paper, we proposed a novel self-supervised spatio-temporal graph learning model (SSTGL) to improve the GNN's potential for use in POI recommendations.In particular, we designed a model-agnostic self-supervised learning framework that took into account spatio-temporal prior knowledge.Based on the framework, we defined spatiotemporal aware data augmentation and pre-text tasks.Extensive experiments showed that SSTGL outperformed existing methods for the POI recommendation task.
In future work, we will extend more views and more flexible data augmentation strategies under the self-supervised learning framework, and we will try to apply graphbased self-supervised learning to more complex tasks, including next POI recommendation and tour recommendation.
nodes with the lowest spatial similarity as negative examples.• Temporal-aware pre-text task (TPT): We took Q T nodes with the highest temporal similarity in M T to the target node as positive examples and ρ(%) nodes with the lowest temporal similarity as negative examples.
•Spatial-aware pre-text task (SPT): We took Q S nodes with the highest spatial similarity in M S to the target node as positive examples and ρ(%)

end while 4. Experiments
[19]erify the validity of SSTGL and explore the reasons behind it, in this section, we conducted extensive experiments to answer the following research questions (RQs): The Foursquare dataset consists of check-in data generated on Foursquare from April 2012 to September 2013.Following[19], we removed users with less than 10 interactions and POIs with less than 10 interactions.After preprocessing, it contained 1,196,248 check-ins between 24,941 users and 28,593 POIs.•Gowalla[19]:TheGowalla dataset consists of check-in data generated on Gowalla from February 2009 to October 2010.As was done in[19], we removed users with less than 15 interactions and POIs with less than 10 interactions.After preprocessing, it contained 1,278,274 check-ins between 18,737 users and 32,510 POIs.

Table 3 .
Overall performance of SSTGL and all baselines for Foursquare dataset, where the best results for each column are shown in bold font, and the second-place results are underlined.

Table 4 .
Overall performance of SSTGL and all baselines for Gowalla dataset, where the best results for each column are shown in bold font, and the second-place results are underlined.

Table 5 .
Overall performance of SSTGL and all baselines for Meituan dataset, where the best results for each column are shown in bold font, and the second-place results are underlined.