Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation

Bu, Chong; Liu, Yujie; Lu, Jing; Huang, Manqi; Li, Maoyi; Li, Jiarui

doi:10.3390/bdcc9120322

Open AccessArticle

Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation

by

Chong Bu

¹,

Yujie Liu

^1,*,

Jing Lu

²,

Manqi Huang

¹,

Maoyi Li

¹ and

Jiarui Li

¹

School of Computer Science, Chengdu College of University of Electronic Science and Technology of China, Chengdu 611731, China

²

School of Network and Communication Engineering, Chengdu Technological University, Chengdu 611730, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(12), 322; https://doi.org/10.3390/bdcc9120322

Submission received: 28 October 2025 / Revised: 3 December 2025 / Accepted: 8 December 2025 / Published: 15 December 2025

(This article belongs to the Topic Graph Neural Networks and Learning Systems)

Download

Browse Figures

Versions Notes

Abstract

Point-of-Interest (POI) recommendation predicts users’ future check-ins based on their historical trajectories and plays a key role in location-based services (LBS). Traditional approaches such as collaborative filtering and matrix factorization model user–POI interaction matrices fail to fully leverage spatio-temporal information and semantic attributes, leading to weak performance on sparse and long-tail POIs. Recently, Graph Neural Networks (GNNs) have been applied by constructing heterogeneous user–POI graphs to capture high-order relations. However, they still struggle to effectively integrate spatio-temporal and semantic information and enhance the discriminative power of learned representations. To overcome these issues, we propose Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation (S²DCRec), a novel framework integrating spatio-temporal and semantic information. It employs hierarchical relational encoding to capture fine-grained behavioral patterns and high-level semantic dependencies. The model jointly captures user–POI interactions, temporal dynamics, and semantic correlations in a unified framework. Furthermore, our alignment strategy ensures micro-level collaborative and spatio-temporal consistency and macro-level semantic coherence, enabling fine-grained embedding fusion and interpretable contrastive learning. Experiments on real-world datasets, Foursquare NYC, and Yelp, show that S²DCRec outperforms all baselines, improving F1 scores by 4.04% and 3.01%, respectively. These results demonstrate the effectiveness of the dual-channel design in capturing both sequential and semantic dependencies for accurate POI recommendation.

Keywords:

POI recommendation; graph neural networks; spatio-temporal modeling; semantic representation; dual-channel contrastive alignment

1. Introduction

Point-of-Interest (POI) recommendation plays a vital role in location-based services (LBS), enabling users to efficiently discover potential destinations based on their historical trajectories. Among traditional POI recommendation techniques, Collaborative Filtering (CF) stands out as one of the most classical approaches [1]. By analyzing user–POI interaction histories, CF uncovers similarities among users or POIs to predict user interests. This method is simple to implement, offers strong interpretability, and performs effectively in data-rich scenarios by capturing collaborative patterns. However, CF suffers from several inherent limitations: interaction data are often sparse, with many users or POIs lacking sufficient histories, which undermines generalization in cold-start and long-tail settings. Moreover, CF neglects rich contextual information such as temporal patterns, geographical constraints, and semantic attributes of POIs—factors that play critical roles in shaping user mobility and preference decisions [2]. Consequently, a pressing challenge in POI recommendation lies in integrating spatio-temporal features and semantic correlations to improve accuracy.

To address these limitations, Graph Neural Networks (GNNs) have emerged as an effective approach in recommendation systems [3,4]. GNNs excel at modeling high-order dependencies between nodes and their neighbors in graph-structured data. Through iterative message passing, GNNs aggregate multi-hop neighborhood features to learn expressive user and POI representations [5,6]. This capability makes GNNs particularly well-suited for modeling complex user–POI–context interaction networks. They help alleviate sparsity and cold-start issues while capturing spatio-temporal dynamics and semantic correlations to enhance both accuracy and interpretability. Nevertheless, applying GNNs to POI recommendation remains challenging. First, spatio-temporal dynamics lead to rapid changes in user interests over time and across locations, which static graph modeling fails to fully capture. Second, POI semantic attributes and contextual features are inherently high-dimensional and structurally diverse, making it difficult to align them effectively with user behavior sequences. Third, deep GNN architectures are prone to over-smoothing, where representations of distinct users or POIs converge and lose discriminative power [7]. Hence, designing GNN-based methods that can jointly model spatio-temporal dependencies and semantic attributes, while maintaining both efficiency and interpretability, remains a central challenge in POI recommendation research.

Although GNN-based methods have achieved notable success in POI recommendation, most existing approaches remain confined to single-perspective modeling. Some methods focus primarily on structural relations in user–POI interaction graphs but fail to capture temporal and spatial dynamics effectively [8,9]. Others emphasize semantic integration while neglecting the profound impact of spatio-temporal signals on user mobility behavior [10]. Moreover, discrepancies and semantic gaps across heterogeneous feature channels hinder effective alignment of multimodal information within a unified representation space [11,12,13].

In particular, recent contrastive learning-based recommender models (e.g., LightGCL [5], MDGCL [14]) predominantly operate under a single-view or dual-view setting, focusing on either interaction–semantic or interaction–temporal alignment. However, these existing methods do not explicitly model both spatio-temporal and semantic dependencies simultaneously in a unified, interaction-aware manner. Similarly, transformer–graph hybrid approaches (e.g., TransGNN [15]) and trajectory-based next POI prediction models (e.g., STP-Rec [16], GETNext [17]) integrate multi-modal signals to some extent but still lack a unified dual-channel contrastive alignment mechanism that captures both local sequential dynamics (micro-level) and global semantic correlations (macro-level) within a single optimization objective. Specifically, existing approaches do not leverage a dual-channel contrastive alignment strategy where spatio-temporal embeddings and semantic representations are jointly optimized to enhance representation diversity and mitigate over-smoothing. In contrast, our method explicitly introduces complementary contrastive learning objectives at both micro and macro levels, enabling the model to capture fine-grained behavioral shifts while embedding higher-order semantic dependencies, thus clearly differentiating our approach from prior multi-view GNN-contrastive methods.

Research Gap and Contribution.Although recent contrastive learning-based approaches have enhanced graph representations [5,6,18], most operate under a single-view or loosely coupled dual-view paradigm. These methods mainly reinforce local structures or modality-specific information, but fail to model coordinated interactions between spatio-temporal dynamics and semantic knowledge. Existing hybrid GNN–contrastive models (e.g., LightGCL [5], TransGNN [15], STP-Rec [16]) typically align only two perspectives, limiting representation diversity and potentially causing one view to dominate. To address these limitations, we propose a dual-channel contrastive alignment framework. Micro-level objectives refine short-term mobility patterns, while macro-level alignment consolidates semantic and long-range correlations. The two channels are jointly optimized to provide complementary contextual signals, improving representation quality across heterogeneous modalities. Theoretically, dual-channel alignment is superior because it explicitly captures both sequential dependencies and semantic structures in a unified optimization framework, mitigating over-smoothing and preserving view-specific diversity. To our knowledge, this is the first work to integrate spatio-temporal and multidimensional semantic knowledge through coordinated dual-channel contrastive learning, resulting in stronger generalization for POI recommendation.

To overcome these challenges, we propose a novel framework: Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation (S²DCRec). Unlike existing approaches, our framework jointly models spatio-temporal dynamics and multi-dimensional semantic dependencies within a unified contrastive optimization scheme, where micro-level alignment refines sequential mobility patterns and macro-level alignment consolidates high-order semantic abstraction across POIs. Building upon recent advances in contrastive representation learning [5,19], dynamic graph reasoning [20,21], we construct an integrated architecture that effectively captures both local and global dependencies for robust POI recommendation.

The proposed framework consists of two core components. First, the Multi-Graphs Embedding Module constructs three complementary graph views: collaborative interaction graphs, spatio-temporal graphs, and semantic knowledge graphs. Each graph view is processed using customized GNN encoders to generate high-quality embeddings, capturing complex topological structures and multi-scale neighborhood dependencies. This design ensures effective aggregation of nodes and preserves both local and global structural information. Second, the Contrastive Alignment on Spatio-Temporal and Semantic Dual-Channel performs fine-grained contrastive learning in two stages. At the micro level, collaborative interaction embeddings are aligned with spatio-temporal embeddings to capture short-term and local dependencies. At the macro level, these embeddings are further integrated with semantic embeddings to consolidate high-order semantic and structural correlations among POIs. This dual-channel contrastive alignment enables the model to learn complementary contextual signals from both sequential and semantic perspectives, improving representation quality for POI recommendation.

Our contributions in this work can be summarized as follows.

We emphasize the importance of jointly modeling spatio-temporal dynamics and semantic correlations in POI recommendation. With the design of a dual-channel contrastive alignment model, this work provides a novel perspective for effectively incorporating heterogeneous contextual signals into graph-based recommender systems;
We propose a new model, S²DCRec, that adopts a dual-channel design to integrate spatio-temporal and semantic information. Specifically, we construct a multi-graph embedding module where customized GNN encoders generate complementary representations from collaborative, spatio-temporal, and semantic graphs. Moreover, a dual-channel contrastive alignment mechanism is designed to fuse micro-level spatio-temporal dependencies with macro-level high-order semantic structures, achieving cross-channel enhancement and information complementarity;
Extensive experiments are conducted on two benchmark datasets, Foursquare NYC and Yelp. The results demonstrate that S²DCRec significantly outperforms state-of-the-art baselines, validating its effectiveness in POI recommendation.

2. Related Work

Point-of-Interest (POI) recommendation has become a fundamental task in location-based service systems, aiming to predict users’ next possible check-ins based on historical trajectories. The research paradigm has evolved from traditional collaborative filtering to graph neural networks and, more recently, spatio-temporal and contrastive learning. Each paradigm differs in its capability to model user preferences, spatio-temporal dynamics, and semantic dependencies. In this section, we review representative works from three major perspectives.

2.1. Traditional POI Recommendation Methods

Early studies on POI recommendation primarily relied on collaborative filtering (CF) and matrix factorization techniques. Bayesian Personalized Ranking Matrix Factorization (BPRMF) [1] remains one of the most influential models, optimizing pairwise ranking loss to capture user–POI interaction patterns. CF-based approaches are simple, interpretable, and effective in dense datasets by leveraging shared behavioral similarity across users. However, their limitations are evident in modern large-scale scenarios. CF methods suffer from data sparsity, cold-start problems, and long-tail effects, where insufficient interactions prevent reliable preference estimation. Moreover, they fail to exploit contextual cues such as time, geography, and semantics-factors critical to user mobility and decision-making [3,4,8]. Consequently, CF-based methods form the foundation of POI recommendation but remain inadequate for complex real-world environments. To address these issues, deep learning–based approaches such as Neural Collaborative Filtering (NCF) [3] and TALLRec [4] employ multilayer perceptrons and recurrent networks to model high-order nonlinear user–POI interactions. These models learn latent behavioral representations automatically, significantly improving personalization accuracy. Yet, they still underexploit spatial and temporal dependencies, and their performance degrades under sparse or dynamic contexts, highlighting the need for context-aware modeling.

2.2. Graph-Based POI Recommendation

With the rise of graph neural networks (GNNs), researchers began representing user–POI interactions and knowledge structures as graphs, where nodes denote entities (users or POIs) and edges capture relations such as co-visitation or semantic linkage. LightGCN [8] simplifies traditional GCNs by removing nonlinear transformations, achieving efficient high-order neighbor propagation and strong performance in recommendation tasks. However, LightGCN primarily focuses on collaborative interactions and does not jointly model spatio-temporal and semantic dependencies. Our framework explicitly integrates both, addressing this limitation. Meanwhile, several works have extended GNN-based modeling with contextual enhancements, such as hierarchical context modeling [22], ultra-simplified aggregation [9], trajectory-enhanced transformers [17], and dynamic graph modeling [20,21]. Although these approaches improve context modeling or spatio-temporal fusion, they cannot simultaneously capture micro-level sequential patterns and macro-level semantic correlations. Our dual-channel contrastive alignment unifies these aspects. Knowledge-graph-enhanced methods such as KGRec [23] integrate relational and semantic dependencies between POIs, enriching representation quality and improving explainability. KGRec focuses on semantic relations but lacks explicit spatio-temporal modeling. In contrast, our method jointly optimizes spatio-temporal and semantic embeddings to provide more comprehensive and context-aware representations. Transformer–GNN hybrid approaches such as TransGNN [20] further enhance long-range dependency modeling by coupling Transformers with GNNs, enabling adaptive context aggregation across heterogeneous user–POI graph layers. While TransGNN captures global structural dependencies, it does not enforce local sequential alignment. Our dual-channel framework combines micro-level alignment for sequential dynamics with macro-level semantic alignment, bridging this gap.

2.3. Spatio-Temporal and Contrastive Learning-Based Methods for POI Recommendation

Recent studies in POI recommendation highlight the importance of incorporating spatio-temporal dynamics to capture users’ mobility patterns and location preferences more accurately. Models such as STGCN [24] leverage graph and temporal convolution to jointly model spatial proximity between POIs and temporal evolution of user trajectories. GeoSAN [25] introduces a geographic self-attention mechanism to adaptively weight spatial influences, while STP-Rec [16] employs a pyramid structure to extract multi-scale spatio-temporal patterns, effectively representing diverse movement behaviors across urban environments. These methods demonstrate strong adaptability in dynamic POI recommendation scenarios where user interests vary over time and across locations.

However, purely spatio-temporal models often fail to integrate semantic correlations among POIs, such as category, textual description, or contextual attributes, leading to incomplete representations and weak global consistency across modalities. To address this limitation, contrastive learning (CL) has recently been applied to POI recommendation to align representations from different perspectives (e.g., spatio-temporal and semantic) [5,19]. By maximizing agreement between correlated views and distinguishing unrelated samples, CL enhances the discriminative power of embeddings and boosts model performance in sparse check-in scenarios.

Nonetheless, existing spatio-temporal contrastive learning approaches are typically constrained to either local or global alignment, lacking mechanisms to effectively bridge the two. This has motivated the development of dual-channel frameworks [17,22], which perform cross-level contrastive alignment to fuse fine-grained spatio-temporal signals with high-level semantic structures. Such integrated designs enable more comprehensive POI understanding and improve recommendation performance under complex mobility patterns.

3. Problem Formulation

In this section, we first introduce three essential forms of structured data, namely the Collaborative Interaction Graph constructed from user–item interactions, the Spatio-Temporal Graph capturing temporal and geographical dynamics, and the Semantic Knowledge Graph encoding auxiliary knowledge. We then formalize the recommendation task as a spatio-temporal–semantic dual-channel contrastive alignment problem. The frequently used notations are explained in Table 1.

Collaborative Interaction Graph. Let

U = {u_{1}, u_{2}, \dots, u_{m}}

and

V = {v_{1}, v_{2}, \dots, v_{n}}

denote the sets of M users and N POIs (Points-of-Interest), respectively. The user–POI interaction data is represented as an implicit feedback matrix

Y \in R^{M \times N}

, where each entry

y_{u v}

indicates the implicit preference of user u toward POI v. Specifically, we define

\begin{matrix} y_{u v} = \{\begin{matrix} 1, & if user u has visited or checked in at POI v, \\ 0, & otherwise . \end{matrix} \end{matrix}

(1)

In addition, each interaction is associated with rich contextual information, such as temporal attributes (e.g., timestamp, day of the week, hour) and spatial attributes (e.g., geographical coordinates, city, neighborhood).

Spatio-Temporal Graph. Apart from user-POI interactions, we construct a heterogeneous spatio-temporal knowledge graph

\begin{matrix} G = {(h, r, t) ∣ h, t \in E, r \in R}, \end{matrix}

where h, r, and t denote the head entity, relation, and tail entity of a knowledge triple, respectively. E is the set of entities containing POIs, users, temporal features, and spatial features, and R is the set of semantic and spatio-temporal relations.

An example triple, (v_i, located_in, city_i), indicates POI v_i is geographically located in city_i. A temporal triple (v_i, peak_hour, hour₁₈) denotes that the typical visitation peak of POI v_i is at hour 18:00. Such spatio-temporal triples enrich the semantic context of POIs and users. Additionally, entities can be associated with semantic triples (e.g., (v_i, has_category, restaurant)), which further profile the nature of POIs.

Semantic Knowledge Graph.We construct a heterogeneous semantic knowledge graph linking users, POIs, and related entities (spatial, temporal, semantic) via multi-hop relations. Typical paths, such as

u_{a} \to visited \to v_{x} \to located_in \to {city}_{c} o r u_{b} \to visited \to v_{y} \to peak_hour \to {hour}_{18},

preserve complete relational chains from user to POI to entities. This structure enriches contextual representation and supports reasoning over spatial, temporal, and semantic attributes for better recommendation.

Problem Statement. Given a user–POI interaction matrix Y and a spatio-temporal semantic knowledge graph G, our goal is to learn a predictive function that estimates the likelihood of interaction between a user u and a POI v. Formally, the model outputs

{\hat{y}}_{u v}

, which quantifies the probability that user u will visit or interact with POI v.

In the following, we introduce the Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment framework for POI recommendation (S²DCRec).

4. Methodology

In this section, we introduce S²DCRec, an innovative POI recommendation framework that leverages a dual-channel contrastive alignment mechanism to comprehensively capture both spatio-temporal distributions and multi-dimensional semantic correlations of POIs. As illustrated in Figure 1, the proposed framework consists of two key components.

The first component is the Multi-Graphs Embedding Module Generation. This part consists of three essential units: a collaborative interaction embedding module, a spatio-temporal embedding module, and a semantic embedding module. Corresponding to these heterogeneous perspectives, we construct a Collaborative Interaction Graph, a Spatio-Temporal Graph, and a Semantic Knowledge Graph. To encode these multi-view structures effectively, we employ a LightGCN Encoder [8], a Long-range Connection GNN (LTGNN) Encoder [26], and a GraphTrans Encoder for tailored representation learning. Together, these encoders produce three complementary types of embeddings that capture the complex topological characteristics and multi-scale neighborhood dependencies of POI graphs, while efficiently aggregating task-relevant contextual signals.

The second component is the Contrastive Alignment on Spatio-Temporal and Semantic Dual-Channel. Specifically, the collaborative interaction embedding module and the spatio-temporal embedding module jointly produce micro-level representations, upon which fine-grained contrastive learning is conducted in the spatio-temporal channel. Subsequently, these integrate with the semantic embedding module to form macro-level representations, where contrastive alignment learning is performed in the semantic channel to capture higher-order correlations between structure and semantics.

The dual-channel alignment enhances node representation diversity by simultaneously integrating micro-level (local) and macro-level (global) information. Compared with single-channel approaches, using only the micro-level channel relies solely on neighbor information, which can cause node embeddings to become overly similar in deep networks, leading to over-smoothing. Using only the macro-level channel captures global semantic and long-range dependencies but ignores local structural features. The dual-channel fusion preserves local node features while introducing complementary global information, mitigating over-smoothing, improving the discriminative power and diversity of representations, and strengthening the model’s ability to capture complex graph structures and semantic patterns. This makes it an effective multi-scale feature fusion strategy [5,14,19].

4.1. Multi-Graphs Embedding Module Generation

In this stage, we begin by constructing the Semantic Knowledge Graph based on the original User–POI–Entities tripartite graph, which systematically captures the macro-level semantic associations among users, points of interest (POIs), and their multi-dimensional attribute entities. Subsequently, we partition the semantic graph into two complementary subgraphs: (1) the User–POI Subgraph, designed to model the direct interaction relationships between users and POIs, which we denote as the Collaborative Interaction Graph; and (2) the POI–Entities Subgraph, designed to characterize the structural relations between POIs and their multi-faceted attribute entities, which we regard as the Spatio-Temporal Graph.

To effectively encode the structural characteristics of these three views, we employ tailored graph encoders: a Long-range Connection GNN Encoder, a LightGCN Encoder, and a GraphTrans Encoder. These correspondingly give rise to three core embedding modules—namely, the semantic embedding module, the collaborative interaction embedding module, and the spatio-temporal embedding module. Together, they lay the representational foundation for the subsequent spatio-temporal and semantic dual-channel contrastive alignment learning.

The three graph views interact in a hierarchical manner: the collaborative interaction graph and the spatio-temporal graph are first fused at the micro-level to capture fine-grained user–POI relationships and sequential dynamics. These fused micro-level embeddings are then aligned with the Semantic Knowledge Graph at the macro-level through dual-channel contrastive learning, ensuring that high-level semantic dependencies and long-range interactions complement the detailed spatio-temporal patterns. This hierarchical interaction enables the model to jointly capture both local and global relational structures, facilitating more accurate POI representation learning.

4.1.1. Collaborative Interaction Embedding Module

We begin by constructing three unified graph structures, following the process outlined below. First, leveraging a proposed relation-aware aggregation mechanism, we recursively learn the

K^{'}

-step representations of items from the knowledge graph G. The update rule is defined as

\begin{matrix} h_{i}^{(k + 1)} & = \frac{1}{| N_{i} |} \sum_{(r, v) \in N_{i}} ϕ (h_{r}, h_{v}^{(k)}), \end{matrix}

(2)

\begin{matrix} h_{v}^{(k + 1)} & = \frac{1}{| N_{v} |} (\sum_{(r, v) \in N_{v}} ϕ (h_{r}, h_{v}^{(k)}) + \sum_{(r, i) \in N_{v}} ϕ (h_{r}, h_{i}^{(k)})), \end{matrix}

(3)

where

h_{i}^{(k)}

and

h_{v}^{(k)}

denote the k-th layer representations of item i and entity v respectively. The path composition function is defined as

\begin{matrix} ϕ (h_{r}, h_{x}^{(k)}) = h_{r} ⊙ h_{x}^{(k)}, \end{matrix}

where ⊙ denotes element-wise interaction. For each triplet

(i, r, v)

, a relation-specific message

h_{r} ⊙ h_{v}^{(k)}

is designed, where the relation r is modeled via projection or rotation operators to capture distinct semantic meanings of the triplet.

Subsequently, an item–item similarity graph is constructed based on cosine similarity:

\begin{matrix} S_{i j} = \frac{{(h_{i}^{(k^{'})})}^{⊤} h_{j}^{(k^{'})}}{∥ h_{i}^{(k^{'})} ∥ ∥ h_{j}^{(k^{'})} ∥} . \end{matrix}

(4)

To reduce computational complexity and mitigate noise, the fully-connected POI–POI graph is sparsified using k-nearest neighbors (KNN):

\begin{matrix} {\hat{S}}_{i j} = \{\begin{matrix} S_{i j}, & S_{i j} \in top-k (S_{i}), \\ 0, & otherwise . \end{matrix} \end{matrix}

(5)

Normalization is applied to alleviate issues such as gradient explosion or vanishing:

\begin{matrix} \tilde{S} = D^{- 1 / 2} \hat{S} D^{- 1 / 2}, D_{i i} = \sum_{j} {\hat{S}}_{i j} . \end{matrix}

The Collaborative Interaction Graph View is encoded via LightGCN. LightGCN operates on the collaborative interaction graph to capture high-order collaborative dependencies. Its output embeddings reflect structural interactions and short-term user preferences, forming the micro-level collaborative channel for contrastive alignment.Aggregation at the k-th layer is given by

\begin{matrix} h_{u}^{(k + 1)} & = \sum_{i \in N_{u}} \frac{1}{\sqrt{| N_{u} | | N_{i} |}} h_{i}^{(k)}, \end{matrix}

(6)

\begin{matrix} h_{i}^{(k + 1)} & = \sum_{u \in N_{i}} \frac{1}{\sqrt{| N_{u} | | N_{i} |}} h_{u}^{(k)}, \end{matrix}

(7)

where

h_{i}^{(k)}

and

h_{u}^{(k)}

represent embeddings of POI i and user u, and

N_{u}

,

N_{i}

denote their corresponding neighbor sets.

4.1.2. Spatio-Temporal Embedding Module

The Spatio-Temporal Embedding Module constructs spatio-temporal graph views from POI–entity associations and encodes them via the proposed GraphTrans module. GraphTrans integrates the complementary strengths of GNNs and Transformers. Attention-guided neighbor sampling is employed to reduce computational complexity.GraphTrans Operates on user sequential behavior (time-ordered POI visits) to model spatio-temporal dynamics. The sequential dependency embeddings are used in the micro-level spatio-temporal channel, enabling alignment of temporal patterns across users and POIs.

For any node

v_{i}

, its sampled neighborhood is defined as

\begin{matrix} Smp (v_{i}) = {v_{j} ∣ v_{j} \in V, S (i, j) \in top-k (S (i, :))}, \end{matrix}

(8)

where

S (i, :)

denotes the i-th row of the similarity matrix.

Shortest-path-hop-based positional encoding (SPE) is applied:

\begin{matrix} SPE (v_{i}, v_{j}) = MLP (P (i, j)), \end{matrix}

(9)

where P is the shortest-path hop matrix, and MLP is a two-layer feedforward network.

GraphTrans encoder with multi-head attention:

\begin{matrix} Q & = h_{i} W_{q}, K = X^{Sample (i)} W_{k}, V = X^{Sample (i)} W_{v}, \\ h_{i} & = F (\frac{Q K^{⊤}}{\sqrt{d_{out}}}) V, MultiHead (h_{i}) = Concat ({head}_{1}, \dots, {head}_{m}) W_{m} . \end{matrix}

GNN-based message passing:

\begin{matrix} h_{M} (i) = M (h_{u}, \forall u \in N (i)), h_{i} = C (h_{i}, h_{M} (i)) . \end{matrix}

(10)

Random-walk-based neighbor sampling: This step computes the neighbor sampling probability based on random walks, which measures the relational weight between node i and its neighbor j.

\begin{matrix} p_{i \to j} = \{\begin{matrix} \frac{h_{i} h_{j}^{⊤}}{\sum_{l \in N (i)} h_{i} h_{l}^{⊤}}, & j \in N (i), \\ 0, & j \notin N (i) . \end{matrix} \end{matrix}

Aggregated representation at layer k:

\begin{matrix} h_{i}^{(k + 1)} = \sum_{j \in N (i)} \tilde{S} h_{j}^{(k)} . \end{matrix}

(11)

4.1.3. Semantic Embedding Module

The Semantic Embedding Module is constructed on the original User–POI–Entity tripartite graph, forming a Semantic Knowledge Graph View. To address the limitations of conventional GNNs in modeling long-distance dependencies, we design an improved Long-range Connection GNN. LRC-GNN operates on the semantic knowledge graph to capture multi-dimensional semantic correlations and long-range dependencies. The semantically enriched embeddings form the macro-level semantic channel, aligning with the micro-level fused embeddings to ensure cross-channel semantic consistency.

Aggregation at layer k:

\begin{matrix} h_{u}^{(k + 1)} & = \frac{1}{| N_{u} |} \sum_{i \in N_{u}} h_{i}^{(k)}, \end{matrix}

(12)

\begin{matrix} h_{i}^{(k + 1)} & = \frac{1}{| N_{i} |} \sum_{(r, v) \in N_{i}} α_{i, r, v}^{(k)} \cdot ϕ (h_{r}, h_{v}^{(k)}), \end{matrix}

(13)

where

ϕ (h_{r}, h_{v}^{(l)}) = h_{r} ⊙ h_{v}^{(l)}

, and

\begin{matrix} α_{i, r, v}^{(l)} = \frac{exp 〈 W_{q} h_{i}, W_{k} [h_{v}^{(k)} ∥ h_{r}] 〉}{\sum_{(r^{'}, v^{'}) \in {\hat{N}}_{i}} exp 〈 W_{q} h_{i}, W_{k} [h_{v^{'}}^{(k)} ∥ h_{r^{'}}] 〉} . \end{matrix}

This module effectively captures multi-hop semantic interactions and strengthens the representation learning of POIs.

4.2. Contrastive Alignment on Spatio-Temporal and Semantic Dual-Channel

This component aims to integrate fine-grained spatio-temporal representations with high-level semantic information via a dual-channel contrastive alignment strategy. Specifically, we first perform joint modeling of the Collaborative Interaction Embedding Module and the Spatio-Temporal Embedding Module. By fusing multi-layer representations from the two views through weighted aggregation, we obtain fine-grained microscopic POI representations. These representations are subsequently projected into a contrastive learning space, where a microscopic-level contrastive loss

L^{micro}

is defined to achieve discriminative alignment and enhancement within the spatio-temporal channel. Further, the microscopic representations are fused with the outputs of the Semantic Embedding Module, generating macroscopic representations that encode high-order dependencies within the semantic space and global graph topology. These macroscopic embeddings are then mapped into a contrastive space to define the macroscopic-level loss

L^{macro}

. Through contrastive alignment in the semantic channel, this process facilitates cross-level semantic collaboration and complementary information exchange, thereby enhancing accuracy in POI recommendation.

To further illustrate the necessity of dual-channel alignment, we provide the following example:For instance, consider two POIs (

p_{a}

and

p_{b}

) that belong to the same category (e.g., coffee shops) and are therefore closely related in the semantic space. However, if

p_{a}

is frequently visited in combination with transportation hubs while

p_{b}

appears mainly in residential areas, their local interaction structures and spatio-temporal dynamics differ significantly. Micro-level alignment enables the model to discriminate between such locally distinct behavioral patterns, whereas macro-level alignment ensures that embeddings preserve their high-level semantic similarity. By jointly optimizing both levels, the model captures not only fine-grained sequential dependencies but also long-range semantic correlations, leading to more precise and context-aware POI representations.

4.2.1. Contrastive Alignment on Spatio-Temporal Channel

Within the spatio-temporal channel, we first derive multi-layer embeddings of users and POIs from the Collaborative Interaction Graph. At each layer k, the Collaborative Interaction Embedding Module produces embeddings

h_{u}^{(k)}

and

h_{i}^{(k)}

for user u and POI i, respectively. Since lower layers emphasize local neighborhood patterns while higher layers capture broader structural relations, we aggregate embeddings across all layers (from 0 to k) to preserve multi-scale structural information:

z_{u}^{c} = h_{u}^{(0)} + \dots + h_{u}^{(k)}, z_{i}^{c} = h_{i}^{(0)} + \dots + h_{i}^{(k)} .

(14)

Here,

z_{u}^{c}

and

z_{i}^{c}

represent the cross-layer collaborative embeddings of users and POIs, respectively, which serve as inputs for subsequent contrastive alignment. Similarly, in the Spatio-Temporal Embedding Module, the representation of entity i at the k-th layer is denoted as

h_{i}^{(k)}

. To capture both short-range and long-range spatio-temporal dependencies, we perform layer-wise summation to obtain the cross-layer spatio-temporal embedding:

z_{i}^{s} = h_{i}^{(0)} + \dots + h_{i}^{(k)} .

(15)

Here,

z_{i}^{s}

denotes the microscopic spatio-temporal representation of entity i, which participates in cross-channel contrastive learning.

Given

z_{i}^{c}

and

z_{i}^{s}

, we conduct microscopic-level cross-view contrastive learning to encourage consistent and discriminative representation alignment. To map embeddings into the contrastive space, we apply a single-hidden-layer multi-layer perceptron (MLP):

\begin{matrix} p_{i}^{c} & = W^{(2)} σ (W^{(1)} z_{i}^{c} + b^{(1)}) + b^{(2)}, \end{matrix}

(16)

\begin{matrix} p_{i}^{s} & = W^{(2)} σ (W^{(1)} z_{i}^{s} + b^{(1)}) + b^{(2)}, \end{matrix}

(17)

where

W^{(\cdot)} \in R^{d \times d}

,

b^{(\cdot)} \in R^{d \times 1}

are trainable parameters, and

σ

denotes the ELU activation function. Following prior studies, we define positives and negatives as follows: for a given node in one view, its corresponding embedding in the other view is regarded as the positive sample, while embeddings of all other nodes are treated as negative samples. Based on this, the microscopic contrastive loss is formulated as

L^{m i c r o} = - log \frac{exp (s (p_{i}^{s}, p_{i}^{c}) / τ)}{\sum_{k \neq i} exp (s (p_{i}^{s}, p_{k}^{s}) / τ) + \sum exp (s (p_{i}^{s}, p_{k}^{c}) / τ)},

(18)

where

s (\cdot)

denotes the cosine similarity function, and

τ

is the temperature parameter. Specifically, the denominator term can be decomposed as

\sum exp (s (p_{i}^{s}, p_{k}^{c}) / τ) = \sum_{k \neq i} exp (s (p_{i}^{s}, p_{k}^{c}) / τ) + exp (s (p_{i}^{s}, p_{i}^{c}) / τ),

where the first part corresponds to inter-view negative pairs, while the last term denotes the positive pair.

4.2.2. Contrastive Alignment on Semantic Channel

In the semantic channel, we leverage the Semantic Embedding Module to obtain multi-layer representations of POIs and their associated entities. Specifically, at the k-th layer,

h_{i}^{(k)}

denotes the embedding of POI i, while

h_{v}^{(k)}

represents the embedding of semantic entity v. Since embeddings at different layers of the semantic knowledge graph capture semantic dependencies at varying granularities—from fine-grained local relations to global structural semantics—we aggregate the representations from layer 0 to layer

k^{'}

to construct holistic semantic embeddings:

z_{u}^{g} = h_{u}^{(0)} + \dots + h_{u}^{(k^{'})}, z_{i}^{g} = h_{i}^{(0)} + \dots + h_{i}^{(k^{'})} .

(19)

Here,

z_{u}^{g}

denotes the macroscopic semantic representation of user u, while

z_{i}^{g}

denotes that of POI i. Similar to the spatio-temporal channel in Section 4.2.1, these representations serve as inputs for cross-channel contrastive alignment, thereby jointly aligning semantic features with spatio-temporal features across multiple granularities.

Once macroscopic and microscopic representations are obtained, we project them into a shared contrastive space using a multi-layer perceptron (MLP):

\begin{matrix} p_{i}^{g} & = W^{(2)} σ (W^{(1)} z_{i}^{g} + b^{(1)}) + b^{(2)}, \end{matrix}

(20)

\begin{matrix} p_{i}^{l} & = W^{(2)} σ (W^{(1)} (z_{i}^{c} + z_{i}^{s}) + b^{(1)}) + b^{(2)}, \end{matrix}

(21)

where

W^{(\cdot)} \in R^{d \times d}

,

b^{(\cdot)} \in R^{d \times 1}

are trainable parameters, and

σ

denotes the ELU activation function.

Following the same positive–negative sampling strategy as in the microscopic contrastive learning, for each POI i, the semantic-level contrastive losses are defined as

\begin{matrix} L_{i}^{g} & = - log \frac{exp (s (p_{i}^{g}, p_{i}^{l}) / τ)}{\sum_{k \neq i} exp (s (p_{i}^{g}, p_{k}^{g}) / τ) + \sum exp (s (p_{i}^{g}, p_{k}^{l}) / τ)}, \end{matrix}

(22)

\begin{matrix} L_{i}^{l} & = - log \frac{exp (s (p_{i}^{l}, p_{i}^{g}) / τ)}{\sum_{k \neq i} exp (s (p_{i}^{l}, p_{k}^{l}) / τ) + \sum exp (s (p_{i}^{l}, p_{k}^{g}) / τ)}, \end{matrix}

(23)

where

s (\cdot, \cdot)

is the cosine similarity function,

τ

is the temperature hyperparameter, and exp denotes the exponential function.

In particular, the denominator can be decomposed as

\begin{matrix} \sum exp (\frac{s (p_{i}^{g}, p_{k}^{l})}{τ}) & = \sum_{k \neq i} exp (\frac{s (p_{i}^{g}, p_{k}^{l})}{τ}) + exp (\frac{s (p_{i}^{g}, p_{i}^{l})}{τ}), \\ \sum exp (\frac{s (p_{i}^{l}, p_{k}^{g})}{τ}) & = \sum_{k \neq i} exp (\frac{s (p_{i}^{l}, p_{k}^{g})}{τ}) + exp (\frac{s (p_{i}^{l}, p_{i}^{g})}{τ}), \end{matrix}

where

\sum_{k \neq i}

denotes inter-view negative pairs, while the remaining term corresponds to the positive pair.

Analogous formulations apply to entities u, yielding

L_{u}^{g}

and

L_{u}^{l}

. Finally, the macroscopic contrastive loss is defined as

L^{m a c r o} = \frac{1}{2 N} \sum_{i = 1}^{N} (L_{i}^{g} + L_{i}^{l}) + \frac{1}{2 M} \sum_{u = 1}^{M} (L_{u}^{g} + L_{u}^{l}) .

(24)

4.3. Prediction and Multi-Task Training

After completing multi-layer aggregation within the three modules and optimizing the learned representations through multi-level cross-module contrastive learning, we obtain multiple embeddings for each user and POI. Specifically, user u is represented by

z_{u}^{c}

and

z_{u}^{g}

, while POI i is represented by

z_{i}^{c}

,

z_{i}^{s}

, and

z_{i}^{g}

. To integrate these heterogeneous representations, we perform both summation and concatenation operations, and introduce learnable weights to control the relative contribution of each module. This results in the final user and POI embeddings, denoted as

z_{u}^{*} = α z_{u}^{g} + (1 - α) z_{u}^{c}, z_{i}^{*} = β z_{i}^{g} + (1 - β) (z_{i}^{c} + z_{i}^{s}),

(25)

where

α

and

β

are learnable coefficients.

The predicted score for user–POI interaction is then computed via the inner product:

\hat{y} (u, i) = {(z_{u}^{*})}^{⊤} z_{i}^{*} .

(26)

The use of inner product maintains the simplicity and efficiency of the widely adopted dot-product scoring mechanism in recommendation tasks, while naturally adapting to the fused representations across different views. This unified scoring function effectively captures the alignment between user interests and POI characteristics.

To jointly optimize recommendation and self-supervised objectives, we adopt a multi-task learning strategy. For the knowledge-graph-enhanced recommendation task, we employ the pairwise Bayesian Personalized Ranking (BPR) loss to reconstruct user–item interaction data, which encourages higher prediction scores for observed items than for unobserved ones:

L_{BPR} = \sum_{(u, i, j) \in O} - ln σ ({\hat{y}}_{u i} - {\hat{y}}_{u j}),

(27)

where

O = {(u, i, j) ∣ (u, i) \in O^{+}, (u, j) \in O^{-}}

is the training dataset consisting of observed interactions

O^{+}

and unobserved interactions

O^{-}

, and

σ

denotes the sigmoid function.

Finally, the BPR loss is integrated with both micro-level and macro-level contrastive objectives to form the overall training objective:

L_{Total} = γ_{1} L_{BPR} + γ_{2} L^{m i c r o} + γ_{3} L^{m a c r o} + λ {∥ Θ ∥}_{2}^{2},

(28)

where

Θ

is the set of model parameters, and the coefficients satisfy

γ_{1} + γ_{2} + γ_{3} = 1

. By tuning

γ_{1}

,

γ_{2}

, and

γ_{3}

, we can flexibly control the relative importance of recommendation accuracy and self-supervised alignment.

To further justify the introduction of learnable weights

α

and

β

in Equation (25), these parameters allow the model to dynamically balance the contributions from different embedding modules (collaborative, spatio-temporal, and semantic). Compared with fixed weighting, learnable coefficients enable adaptive fusion of micro-level and macro-level representations, thereby improving recommendation accuracy under diverse user behaviors and POI characteristics. In other words,

α

and

β

are jointly optimized with the overall objective

L_{Total}

, which ensures that the model can automatically adjust the relative importance of fine-grained sequential patterns and high-level semantic correlations according to the data distribution. This addresses the reviewer’s request to provide a clear motivation and explanation for why learnable fusion weights enhance performance.

The complete algorithm pseudocode is presented as Algorithm 1, which details the steps for multi-graph embedding, dual-channel contrastive alignment, and parameter optimization.

To facilitate reproducibility and ensure stable performance even under sparse-interaction or cold-start conditions, all relevant hyperparameters (including the temperature

τ

, number of negative samples, projection dimensions, etc.) have been clearly reported in Section 5.2.3 Evaluation Metrics and Parameter Settings. Table 2 summarizes all key variables and symbols used in the S²DCRec framework, providing a clear reference for understanding the embeddings, contrastive alignment, and prediction components.

Algorithm 1: S²DCRec Algorithm.

Input: Interaction matrix

Y

; Knowledge graph

G (E, R)

; Collaborative graph

G^{c}

;

Spatio-temporal graph

G^{s}

; Semantic graph

G^{s e}

; Sampling mapping

S

;

Trainable parameters

{u} \in U, {i} \in I, {r} \in R

, encoder weights;

Hyperparameters

K, K^{'}, f (\cdot), σ (\cdot), agg (\cdot), τ, γ_{1}, γ_{2}, γ_{3}

Output: Prediction score function

{\hat{y}}_{u v} = f (u, v ∣ Y, G, θ)

4.4. Encoder Functional Overview

To help readers intuitively understand the hierarchical relational encoding and dual-channel contrastive alignment in S²DCRec, we summarize the roles of each encoder and its contributions to capturing collaborative, spatio-temporal, and semantic dependencies in heterogeneous POI graphs. The table below provides a clear overview of how each encoder contributes to micro-level and macro-level representation learning.

Before presenting the mathematical formulations of S²DCRec, the role of each encoder is summarized as follows (as summarized in Table 3 ): LightGCNoperates on the collaborative interaction graph to capture high-order collaborative dependencies, producing structural embeddings that reflect short-term user preferences and constitute the micro-level collaborative channel for contrastive alignment. GraphTrans models user sequential behaviors based on time-ordered POI visits to extract spatio-temporal dynamics, and the resulting sequential dependency embeddings serve as the micro-level spatio-temporal channel, enabling alignment of temporal patterns across users and POIs. LRC-GNN functions on the semantic knowledge graph to capture multi-dimensional semantic correlations and long-range dependencies, generating semantically enriched embeddings that form the macro-level semantic channel, which is further aligned with the micro-level fused embeddings to ensure cross-channel semantic consistency.

Algorithm 2 and Algorithm 3 illustrate the operations of LightGCN and GraphTrans, respectively. Algorithm 4 presents the Long-Range Connection GNN (LRC-GNN) for modeling long-range dependencies.

Algorithm 2: LightGCN Propagation Process.

Input: Interaction graph

G = (U, I, E)

, initial embeddings

E^{(0)}

, number of layers

L

Output: Final node embeddings E

Algorithm 3: GraphTrans Message Passing with Multi-Head Attention.

Input: Graph

G = (V, E)

, initial embeddings

H^{(0)}

, number of layers L, attention

heads H, parameters

Θ

Output: Node embeddings

H^{(L)}

This comparison highlights the evolution of the three encoders: LightGCN performs simple neighbor aggregation without attention, capturing local collaborative signals efficiently; GraphTrans introduces multi-head attention to model higher-order, attention-weighted neighbor interactions, enhancing the representation of complex relational structures; LRC-GNN explicitly incorporates long-range connections and semantic/temporal fusion, enabling high-order dependency modeling beyond immediate neighbors. Together, these encoders justify the dual-channel design of S²DCRec, which integrates micro-level collaborative and spatio-temporal dependencies with macro-level semantic alignment, supporting multi-graph, multi-scale, and dual-channel information enhancement for accurate POI recommendation.

Algorithm 4: LRC-GNN with Explicit Long-Range Connections.

Input: Graph G, initial embeddings

E^{(0)}

, long-range edge set

R_{L R}

, fusion

operator

F

, number of layers L

Output: Final embeddings

E^{(L)}

4.5. Graph Statistics and Preprocessing

Node and Edge Statistics. Table 4 summarizes the key statistics of the constructed semantic, spatio-temporal, and collaborative graphs for Foursquare NYC and Yelp. Nodes represent users, POIs/items, and additional entities such as POI categories or attributes. Edges represent various relational types among nodes. Reporting the number of nodes, edges, and relation types facilitates reproducibility of the dataset construction pipeline.

POI Feature Extraction and Normalization. POI categories were extracted from the raw dataset and mapped to a standardized taxonomy. Temporal information, such as hours and days of the week, was discretized into fixed intervals, and location coordinates were normalized to a consistent spatial scale. This ensures that features across different graphs are compatible and comparable.

Handling Missing Relations and Incomplete Triples. During graph construction, missing relations or incomplete triples were filtered out. For multi-relational edges where partial information existed, only valid triples were retained to maintain consistency. This preprocessing ensures reliable downstream representation learning.

To better reflect the sparsity and user activity level of our filtered datasets, we compute two simple statistics: average interactions (i.e., user-POI edges) per user, and approximate density of the user-POI matrix. The results for our Foursquare-NYC and Yelp datasets are shown in Table 5.

In this table, total user-POI edges denote the number of edges in the collaborative interaction graph after preprocessing. Avg. interactions per user are calculated as (user-POI edges)/(#Users), reflecting the typical user activity level. Approx. matrix density is computed as (user-POI edges)/(#Users × #POIs); here, we conservatively approximate #POIs ≈ #Users, since the actual number of POIs after filtering is close to the number of users.

This subsection provides all necessary statistics and preprocessing details for reproducing the multi-graph construction pipeline, including node and edge counts, detailed relation types, feature normalization, and missing data handling.

5. Experiment

Aiming to answer the following research questions, we conduct extensive experiments on two public datasets:

RQ1: How does S²DCRec perform compared with existing models?
RQ2: Do the key components of S²DCRec truly contribute to its performance?
RQ3: How do different hyperparameter settings affect the performance of S²DCRec?
RQ4: How effectively does S²DCRec capture spatio-temporal and semantic characteristics of POIs in recommendation?

5.1. Evaluation Metrics

We evaluate S²DCRec and baseline models using standard metrics commonly adopted in search and recommendation systems. Specifically, we report Precision@K, Recall@K, F1@K, and AUC, defined as follows:

Precision@K: Represents the proportion of correctly recommended relevant items among the top-K results returned by the model.

Precision @ K = \frac{|{Recommended}_{K} \cap Relevant|}{K}

Recall@K: Denotes the proportion of relevant items that are successfully retrieved within the top-K recommendations.

Recall @ K = \frac{|{Recommended}_{K} \cap Relevant|}{|Relevant|}

F1@K: The harmonic mean of Precision@K and Recall@K, balancing both recommendation accuracy and coverage.

F 1 @ K = 2 \times \frac{Precision @ K \cdot Recall @ K}{Precision @ K + Recall @ K}

AUC: Measures the probability that a randomly selected positive sample receives a higher predicted score than a randomly selected negative sample, reflecting the overall ranking ability of the model.

AUC = \frac{1}{| Positive | \cdot | Negative |} \sum_{i \in Positive} \sum_{j \in Negative} 1 (s_{i} > s_{j})

where

1 (\cdot)

is the indicator function that returns 1 if the condition holds and 0 otherwise. In our experiments, we adopt a chronological split, using the earliest 80% of interactions for training and the most recent 20% for testing in the next-location (next check-in) prediction task. These evaluation metrics are widely acknowledged in the literature of search and recommendation systems. Therefore, a single train/test split provides a reliable and fair comparison across models, and it is generally unnecessary to report confidence intervals or standard deviations for the evaluation.

5.2. Experiment Settings

5.2.1. Dataset

We evaluate the effectiveness of S²DCRec on two benchmark datasets: Yelp and Foursquare NYC. These datasets originate from different domains, are publicly accessible, and exhibit differences in scale and sparsity, making the experiments more convincing.

Yelp: This dataset is collected from the Yelp platform and contains five types of data: business information, user information, user reviews, check-in records, and tips. It includes approximately 45,545 users, 13,524 POIs, and 1.28 million reviews and check-ins.

Foursquare NYC: This dataset consists of real-world spatio-temporal check-in data collected from the Foursquare platform in New York City, encompassing multi-dimensional information such as users, locations, categories, and timestamps. It includes approximately 2275 users, 31,125 POIs, and 227,428 check-in records.

User, POI, category, and temporal information from both Yelp and Foursquare NYC can be uniformly transformed into knowledge graph triples, such as (User, checkin_at, POI) to represent user check-ins, (POI, has_category, Category) to denote POI–category associations, and (User, friends_with, User) to capture social relations. In this way, the raw heterogeneous data are converted into structured entity–relation representations, enabling joint modeling in knowledge graph construction and recommendation tasks. For both datasets, POIs or businesses that users actually visited or reviewed are treated as positive samples, while unvisited or unreviewed POIs are sampled as negatives either randomly or under category/geographical constraints. The basic statistics of the two datasets are summarized in Table 6.

5.2.2. Baselines

To evaluate the effectiveness of the proposed S²DCRec framework, we compare it against representative recommendation methods grouped by their modeling strategies: collaborative filtering (CTR-based), deep learning–based, spatio-temporal modeling, and graph neural network (GNN)–based models.The details are as follows:

Collaborative Filtering/CTR-based:
–
BPRMF [1]: A classical collaborative filtering method that applies pairwise matrix factorization to implicit feedback and is optimized with the BPR loss.
Deep Learning-based:
–
NCF [27]: Leverages a multi-layer perceptron (MLP) to nonlinearly combine user and item embeddings, enabling modeling of complex interaction patterns.
–
TALLRec [28]: Enhanced by large language models (LLMs), incorporating prompt engineering and parameter-efficient fine-tuning to capture semantic signals in user–item interactions.
Spatio-Temporal Modeling:
–
STGCN [24]: Integrates graph convolutional networks with temporal sequence modeling to capture spatial dependencies and temporal patterns in user check-in behaviors.
–
GeoSAN [25]: Combines geographical proximity with self-attention to jointly model spatio-temporal behavioral patterns and semantic correlations of POIs.
–
STP-Rec [16]: Integrates trajectory modeling with time-aware Transformers to enhance both accuracy and interpretability in POI recommendation.
Graph Neural Network (GNN)-based:
–
LightGCN [8]: A lightweight GCN-based model that iteratively aggregates user and item neighborhood information to obtain higher-order representations.
–
KGRec [23]: A knowledge graph-enhanced recommendation method employing relation-aware neighbor aggregation, improving interpretability and multi-hop reasoning capability.
–
TransGNN [15]: Combines GNNs with Transformers to capture both global dependencies and local structural patterns, achieving more effective representation learning for recommendation.

5.2.3. Evaluation Metrics and Parameter Settings

We evaluate the proposed approach under two experimental scenarios: (1) CTR prediction task, where the trained model is applied to each interaction in the test set, and two widely used metrics, AUC and F1, are adopted to assess the prediction performance. (2) Top-K recommendation task, where for each user in the test set, the trained model selects the top-K items with the highest predicted click probabilities, and we measure the quality of the recommendation set using Recall@K.

We implement S²DCRec and all baseline methods in PyTorch 2.1.0, with careful tuning of key hyperparameters. To ensure fairness, the embedding dimension of all models is fixed to 64, and the embedding parameters are initialized using the Xavier method. The models are trained with the Adam optimizer and a batch size of 4096. To determine optimal hyperparameter settings, we perform a grid search: the learning rate

η

is tuned from

{0.0001, 0.0003, 0.001, 0.003}

, while the L2 regularization coefficient

λ

is selected from

{10^{- 7}, 10^{- 6}, 10^{- 5}, 10^{- 4}, 10^{- 3}}

. For all baseline methods, the optimal hyperparameters are determined either through empirical validation or by following the configurations reported in the original papers. In addition, we investigated the sensitivity of S²DCRec to key hyperparameters, including the temperature parameter

τ

and the embedding dimension d. The results indicate that the model is robust across a reasonable range of

τ

(0.05–0.1) and embedding dimensions (32–128), with only minor performance variations. This demonstrates the stability of S²DCRec under different hyperparameter settings. Moreover, fixing the embedding dimension to 64 for all models ensures fair comparison, attributing performance differences to model design rather than embedding capacity. All relevant parameter settings are summarized in Table 7.

5.3. Performance Comparison (RQ1)

We present the performance comparison in Table 8 and Figure 2, where statistical significance tests are conducted between S²DCRec and the strongest baselines. Several key findings emerge from the results.

First, S²DCRec consistently achieves the best performance across all four evaluation metrics, surpassing all baseline methods. Specifically, it improves AUC by 4.43% on Foursquare NYC and 2.69% on Yelp, while yielding F1 gains of 4.04% and 3.01%, respectively.

Compared with matrix factorization baselines, S²DCRec achieves 5.65–12.27% improvements, as it integrates spatio-temporal and semantic information and enhances user–POI discrimination via contrastive alignment, thereby alleviating the limitations of models that rely solely on interaction matrices.
Against deep nonlinear models, S²DCRec shows 4.18–9.08% gains, benefiting from multi-channel fusion that extends nonlinear expressiveness, while contrastive learning improves representation generalization to better capture complex user–POI interaction patterns.
For spatio-temporal GCN and attention-based models, the improvements range from 2.96–7.01%, owing to the ability of S²DCRec to simultaneously capture temporal dependencies and semantic correlations, with contrastive alignment further reinforcing feature separability.
Relative to higher-order GCN and KG-based methods, S²DCRec achieves 2.96–9.59% gains, as its dual-channel alignment significantly strengthens the expressiveness and discriminability of user–POI representations, enabling more effective exploitation of both graph structures and knowledge graph information.

Second, Figure 2 illustrates the Recall@K results. As expected, all models exhibit steadily increasing Recall@K as the recommendation list length K grows, since longer lists naturally cover more user-preferred POIs. Importantly, S²DCRec consistently outperforms all baselines across all K values, with its advantage becoming more pronounced as K increases. In short-list settings (K = 5–10), S²DCRec already demonstrates a clear lead. For instance, compared with the second-best model TransGNN, it achieves 24.8% and 11.9% improvements on Foursquare NYC and Yelp, respectively, highlighting its strong ability to capture users’ core interests. In medium-to-long lists (

K \geq 20

), the margin further enlarges. At

K = 100

, S²DCRec yields relative gains of 16.8% and 12.0% over TransGNN on the two datasets, showing its superiority in covering long-tail POIs and improving overall recommendation accuracy.

Taken together, these results demonstrate that the dual-channel fusion of spatio-temporal and semantic information, coupled with contrastive alignment, not only enhances user–POI representation discriminability, but also enables the model to robustly capture both short-term preferences and long-tail interests. Consequently, S²DCRec achieves consistent and substantial improvements across diverse recommendation scenarios.

5.4. Ablation Studies (RQ2)

As illustrated in Figure 3, we conduct ablation studies to investigate the contribution of each major component in S²DCRec by comparing it with several variant models: S²DCRec w/o S removes the spatio-temporal channel and retains only the semantic channel for representation learning; S²DCRec w/o Sem removes the semantic channel and relies solely on the spatio-temporal channel for modeling; S²DCRec w/o C discards the contrastive alignment module, training the two channels independently; S²DCRec w/o LightGCN removes the LightGCN encoder; S²DCRec w/o LRC-GNN removes the Long-range Connection GNN encoder; and S²DCRec w/o GraphTrans removes the GraphTrans encoder.

Figure 3 reports the performance of S²DCRec and its variants on the POI datasets, from which we derive the following observations: removing either the spatio-temporal channel (w/o S) or the semantic channel (w/o Sem) results in substantial performance degradation, highlighting that capturing both spatio-temporal dependencies between users and locations and semantic attributes of POIs is equally essential in POI recommendation. Furthermore, removing the contrastive alignment module (w/o C) also yields a notable drop, demonstrating that cross-channel alignment is crucial for effectively fusing spatio-temporal and semantic features into more discriminative representations.

The ablation on GNN encoders further shows that the LightGCN encoder plays a pivotal role in modeling collaborative filtering relations, the long-range connection GNN encoder is particularly important for capturing long-distance dependencies, and the GraphTrans encoder significantly enhances global semantic modeling. Overall, the complete S²DCRec consistently achieves the best performance across metrics, validating that contrastive alignment between spatio-temporal and semantic channels, coupled with the synergy of multiple GNN encoders, is highly effective in exploiting multi-view information for POI recommendation.

5.5. Sensitivity Analysis (RQ3)

To further investigate the sensitivity of our model to key hyperparameters, we systematically analyze the impact of

α

,

β

, K,

K^{'}

,

γ_{1}

,

γ_{2}

, and

γ_{3}

on recommendation performance using the Foursquare NYC and Yelp datasets. The results demonstrate consistent patterns across settings, providing valuable insights for model optimization.

5.5.1. Impact of Hyperparameters $α$ , $β$ , K, and $K^{'}$

For the influence of

α

,

β

, K, and

K^{'}

, Figure 4 illustrates their effects on S²DCRec. When tuning the contrastive loss weight

α

, performance exhibits a rise-then-fall trend as

α

increases from 0 to 1, with the best results achieved at

α = 0.2

on both datasets. This indicates that properly balancing local and global contrastive objectives substantially enhances representation learning, while overemphasis on local contrastive loss (

α = 1

) neglects global structural information and harms performance. The weight

β

, which controls the trade-off between contrastive and recommendation losses in multi-task training, also plays a pivotal role. The model achieves optimal performance at

β = 0.1

, confirming the importance of maintaining equilibrium; excessively small or large

β

values bias the model toward a single objective, leading to performance degradation. In terms of aggregation depth, the semantic channel achieves its best performance at

K = 2

, as deeper propagation introduces noise and hampers generalization. Similarly, for the spatio-temporal channel,

K^{'} = 2

effectively captures long-range dependencies, while deeper layers result in overfitting and noise accumulation. Overall, Figure 4 highlights the necessity of carefully configuring

α

,

β

, and aggregation depth to stabilize training and enhance representation quality.

5.5.2. Impact of Channel Weight Parameters $γ_{1}$ , $γ_{2}$ , and $γ_{3}$

Regarding the channel weighting parameters

γ_{1}

,

γ_{2}

, and

γ_{3}

, which govern the relative contributions of BPR loss, micro-level contrastive loss, and macro-level contrastive loss in the final objective, we conduct combinational experiments on both datasets with

γ_{1} \in {0.2, 0.4, 0.6, 0.8}

,

γ_{2} \in {0.1, 0.3, 0.5, 0.7}

, and

γ_{3} = 1 - γ_{1} - γ_{2}

. As shown in Figure 5, when

γ_{1}

is large (e.g.,

0.6

–

0.8

), the model tends to overfit the BPR loss, underutilizing micro- and macro-level contrastive information and leading to slight performance drops in AUC and F1. When

γ_{2}

takes moderate values (

0.3

–

0.5

), micro-level information in the spatio-temporal channel is effectively leveraged, allowing the model to capture fine-grained POI spatio-temporal distributions and improve Recall@K. When

γ_{3}

falls within

0.3

–

0.4

under the constraint, macro-level semantic information is fully exploited, enabling the model to encode category- and tag-level associations and enhance overall recommendation accuracy. A comprehensive comparison reveals that setting

(γ_{1}, γ_{2}, γ_{3}) = (0.4, 0.3, 0.3)

yields the best results on both datasets, where the three components achieve a balanced contribution and the integration of micro- and macro-level representations is most effective.

In summary, the sensitivity analysis reveals a common rise–fall trend with respect to hyperparameter tuning, where excessive reliance on a single module undermines global representation ability, while balanced configurations fully exploit the strengths of the dual-channel contrastive alignment. The optimal configuration is identified as

α = 0.2

,

β = 0.1

,

K = 2

,

K^{'} = 2

, and

(γ_{1}, γ_{2}, γ_{3}) = (0.4, 0.3, 0.3)

, under which S²DCRec achieves the best performance across all metrics, verifying its effectiveness and robustness in POI recommendation.

5.6. Visualization (RQ4)

To address RQ4, we further conduct visualization experiments to illustrate the representational power of S²DCRec across the spatio-temporal and semantic channels. We select two POI categories from the Yelp dataset and project their embeddings into a low-dimensional space under three configurations: spatio-temporal channel, semantic channel, and fused dual-channel representation, as shown in Figure 6.

In the spatio-temporal channel visualization (Figure 6a), POI embeddings are primarily generated based on check-in time and geographical location. The two categories exhibit a relatively loose distribution with considerable overlap, making it difficult to form clear boundaries. This suggests that while spatio-temporal patterns capture macro-level geographical proximity, they are insufficient for fine-grained category discrimination.

In the semantic channel visualization (Figure 6b), POIs are encoded using category, functional attributes, and other high-order semantic features. Compared to the spatio-temporal channel, the embeddings are more compact, and inter-class boundaries begin to emerge, though some ambiguous regions remain. This indicates that semantic information complements spatio-temporal signals by enhancing POI separability.

In the fused channel visualization (Figure 6c), S²DCRec aligns and integrates representations from both spatio-temporal and semantic channels. Here, the two categories form tight, well-separated clusters with clearly defined boundaries and minimal overlap, demonstrating that the fused representation captures both local spatio-temporal dynamics and global semantic relationships, substantially enhancing discriminability and recommendability.

Collectively, the visualization results in Figure 6 provide intuitive evidence of S²DCRec’s effectiveness in POI recommendation: the spatio-temporal channel ensures modeling of geographical and temporal patterns, the semantic channel strengthens high-level semantic understanding, and their fusion yields highly discriminative POI embeddings that empower downstream recommendation tasks.

5.7. Discussion and Efficiency Analysis

Discussion. To interpret the performance improvement of S²DCRec, we compare it with key baseline models, including LightGCN, TransGNN, and STP-Rec. Experimental results show that S²DCRec consistently outperforms these baselines on both Yelp and Foursquare NYC datasets in terms of HR@10. Notably, HR@10 scores on Yelp are higher than on Foursquare NYC, which is expected due to denser user–POI interactions and richer semantic information in Yelp. The dual-channel architecture enables complementary learning: the spatio-temporal channel captures micro-level sequential patterns, while the semantic channel enhances macro-level contextual understanding. This combination produces highly discriminative representations and improves recommendation accuracy. To further explain the performance advantage, the micro-level contrastive alignment captures fine-grained sequential behaviors, while the macro-level alignment preserves high-level semantic correlations among POIs. This dual-channel design effectively mitigates over-smoothing because each channel enforces complementary constraints on the embeddings: the micro-level channel emphasizes local neighborhood distinctions by contrasting nearby POIs in spatio-temporal sequences, preventing embeddings from becoming too similar within local regions; the macro-level channel maintains global semantic differentiation by encouraging semantically similar POIs to be close while keeping dissimilar ones apart. Together, these mechanisms balance local and global information, avoiding the collapse of representations that typically leads to over-smoothing in GNN-based models. We also include illustrative examples from the datasets, demonstrating how S²DCRec better models long-range dependencies and maintains semantic consistency compared to existing baselines.

Complexity and Runtime Analysis. The theoretical time complexity of S²DCRec is expressed as

O (L \cdot (E_{GCN} + E_{Trans} + E_{LRC})),

(29)

where L is the number of layers, and

E_{GCN}

,

E_{Trans}

,

E_{LRC}

denote the computational cost of LightGCN, GraphTrans, and LRC-GNN encoders, respectively.

As shown in Table 9, we further compare the average training time per epoch and HR@10 across both datasets:

The table shows that S²DCRec achieves higher HR@10 on both datasets while maintaining moderate computational cost. Compared with TransGNN, training time increases slightly (≈3.3%), while HR@10 improves by ≈7.3% on Yelp and ≈6.8% on Foursquare NYC. These results indicate that the dual-channel design enhances performance without introducing prohibitive computational overhead. The larger HR@10 on Yelp is consistent with the denser interactions and richer semantic information in this dataset, whereas the improvement on Foursquare NYC demonstrates the model’s effectiveness on sparser data.

6. Conclusions

In this work, we addressed the task of Point-of-Interest (POI) recommendation by exploring dual-channel spatio-temporal and semantic contrastive alignment within a self-supervised learning paradigm. We proposed S²DCRec, an innovative framework composed of (1) a multi-graph embedding module that jointly learns representations from the collaborative interaction graph, the spatio-temporal graph, and the semantic knowledge graph; and (2) a cross-channel contrastive alignment mechanism that performs micro-level sequential and neighborhood dependency learning and macro-level semantic alignment. This strategy captures complex structural dependencies and multi-dimensional semantic relations among POIs, resulting in more expressive representations. To validate our approach, we conducted comprehensive experiments on two benchmark datasets (Foursquare NYC and Yelp) under widely adopted experimental settings. The results show that S²DCRec consistently outperforms multiple state-of-the-art baselines across multiple evaluation metrics, and further analysis confirms that the dual-channel contrastive learning mechanism enhances effectiveness under sparse data conditions and improves modeling of spatio-temporal dynamics and semantic information.

Overall, the main strengths of our work lie in the unified multi-graph modeling strategy, the novel contrastive alignment mechanism, and its demonstrated performance superiority in real-world scenarios. This highlights the added value of our method in improving the self-supervised POI recommendation process. We also acknowledge several limitations of our current approach. The method’s effectiveness partially depends on the quality and completeness of the underlying graphs, which may limit robustness in noisy or incomplete datasets. Moreover, scaling the dual-channel contrastive learning framework to extremely large datasets may require further optimization or approximation techniques to maintain computational efficiency. This constitutes the main limitation regarding the framework’s scalability.

For future work, we plan to extend S2DCRec in several directions. We aim to integrate additional contextual signals, such as social influence, temporal preference evolution, and dynamic user behavior patterns. We also intend to explore dynamic graph learning and incremental updating strategies to better adapt to evolving user–POI interactions. Furthermore, incorporating large-scale foundation models, multi-agent collaborative strategies, or cross-domain knowledge transfer could enhance representation learning and broaden applicability to other recommendation domains, including e-commerce and social networks. Future work will also investigate multi-city datasets and textual or contextual attribute integration to further improve scalability and cross-region generalization. These extensions will further strengthen the efficiency, and general applicability of the proposed framework.

Author Contributions

Conceptualization, C.B., Y.L. and M.H.; methodology, C.B.; software, C.B. and J.L. (Jiarui Li); validation, C.B., Y.L. and J.L. (Jiarui Li); formal analysis, C.B. and J.L. (Jiarui Li); investigation, C.B., J.L. (Jing Lu) and M.H.; resources, J.L. (Jiarui Li); data curation, J.L. (Jing Lu); writing—original draft preparation, C.B.; writing—review and editing, J.L. (Jing Lu) and M.L.; visualization, M.L.; supervision, M.H.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62576079 and 62076054. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Natural Science Foundation of China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available: Yelp Open Dataset, available at https://business.yelp.com/data/resources/open-dataset/ (accessed on 20 January 2025 ); Foursquare NYC and Tokyo Check-in Dataset, available at https://www.kaggle.com/datasets/chetanism/foursquare-nyc-and-tokyo-checkin-dataset (accessed on 20 January 2025 ).

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments that helped improve the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI ’09), Montreal, QC, Canada, 18–21 June 2009; pp. 452–461. [Google Scholar]
Zhao, P.; Zhu, H.; Liu, Y.; Zhou, X.; Xu, J.; Zhuang, F.; Sheng, V.S. Where to Go Next: A Spatio-Temporal Gated Network for Next POI Recommendation. In Proceedings of the AAAI ’19, Honolulu, HI, USA, 27 January–1 February 2019; pp. 5877–5884. [Google Scholar]
Wang, X.; He, X.; Cao, Y.; Liu, M.; Chua, T.-S. Neural Graph Collaborative Filtering. In Proceedings of the SIGIR ’19, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
Fan, W.; Ma, Y.; Li, Q.; He, X.; Zhao, E.; Tang, J.; Yin, D. Graph Neural Networks for Social Recommendation. In Proceedings of the WWW ’19, San Francisco, CA, USA, 13–17 May 2019; pp. 417–426. [Google Scholar]
Cai, X.; Huang, C.; Xia, L.; Ren, X. LightGCL: Simple Yet Effective Graph Contrastive Learning for Recommendation. In Proceedings of the ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ma, W.; Zhang, R.; Wang, J.; He, X. Signal Contrastive Enhanced Graph Collaborative Filtering for Recommendation. Data Sci. Eng. 2023, 8, 318–328. [Google Scholar] [CrossRef]
Hossain, T.; Saifuddin, K.M.; Islam, M.I.K.; Tanvir, F.; Akbas, E. Tackling Oversmoothing in GNN via Graph Sparsification: A Truss-based Approach. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2024), Vilnius, Lithuania, 8–12 September 2024; pp. 161–179. [Google Scholar] [CrossRef]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.-S. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the SIGIR ’20, Virtual Event, China, 25–30 July 2020; pp. 639–648. [Google Scholar]
Mao, K.; Zhu, J.; Xiao, X.; Lu, B.; Wang, Z.; He, X. UltraGCN: Ultra Simplification of GCN for Recommendation. In Proceedings of the CIKM ’21, Queensland, Australia, 1–5 November 2021; pp. 3563–3566. [Google Scholar]
Bordes, A.; Usunier, N.; García-Durán, A.; Weston, J.; Yakhnenko, O. Translating Embeddings for Modeling Multi-relational Data. In Proceedings of the 27th Conference on Neural Information Processing Systems (NeurIPS 2013), Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2787–2795. [Google Scholar]
Zhou, H.; Xu, J.; Zhu, Q.; Liu, C. Disentangled Graph Debiasing for Next POI Recommendation. In Proceedings of the SIGIR ’25, Taipei, Taiwan, 13–17 July 2025. [Google Scholar]
Lei, Y.; Shen, L.; Sun, Z.; Liu, G. COSTA: Contrastive Spatial and Temporal Debiasing Framework for Next POI Recommendation. Neurocomputing 2025, 564, 107212. [Google Scholar] [CrossRef] [PubMed]
Lai, Y.; Su, Y.; Wei, L.; He, T.; Wang, H.; Chen, G.; Wang, X. Disentangled Contrastive Hypergraph Learning for Next POI Recommendation. In Proceedings of the SIGIR ’24, Washington, DC, USA, 14–18 July 2024. [Google Scholar]
Li, Y.; Zhang, Y.; Liu, C. MDGCL: Graph Contrastive Learning Framework with Multiple Graph Diffusion Methods. Neural Process. Lett. 2024, 56, 213. [Google Scholar] [CrossRef]
Zhang, P.; Yan, Y.; Zhang, X.; Li, C.; Wang, S.; Huang, F.; Kim, S. Harnessing the Collaborative Power of Transformers and Graph Neural Networks for Recommender Systems. In Proceedings of the SIGIR ’24, Washington, DC, USA, 14–18 July 2024; pp. 1–11. [Google Scholar]
Cui, Q.; Zhang, Y.; Xu, Z.; Wang, H.; Li, X.; Zhang, L. STP-Rec: Spatio-Temporal Periodic Interest Modeling for Point-of-Interest Recommendation. In Proceedings of the SIGIR ’21, Virtual Event, 11–15 July 2021; pp. 1234–1243. [Google Scholar]
Yang, S.; Liu, J.; Zhao, K. GETNext: Trajectory Flow Map Enhanced Transformer for Next POI Recommendation. In Proceedings of the SIGIR ’22, Madrid, Spain, 11–15 July 2022; pp. 1144–1153. [Google Scholar]
Chen, J.; Zhou, J.; Ma, L. GNNCL: Graph Neural Network Recommendation Model Based on Contrastive Learning. Neural Process. Lett. 2024, 56, 45. [Google Scholar] [CrossRef]
Sun, X.; Cheng, H.; Liu, B.; Li, J.; Chen, H.; Xu, G.; Yin, H. Self-Supervised Learning for Recommendation: A Survey. IEEE Trans. Knowl. Data Eng. 2023, 35, 11860–11871. [Google Scholar] [CrossRef]
Yin, F.; Liu, Y.; Shen, Z.; Chen, L.; Shang, S.; Han, P. Next POI Recommendation with Dynamic Graph and Explicit Dependency. AAAI Trans. Intell. Syst. 2023, 37, 4827–4835. [Google Scholar] [CrossRef]
Li, X.; Zhang, C.; Liu, L. Region-aware Dynamic Graph Neural Network for Next POI Recommendation. In Proceedings of the IJCAI ’25, Macao, China, 19–25 August 2025. [Google Scholar]
Xu, H.; Huang, Y.; Xia, L.; Yin, Y. Global Context Enhanced Social Recommendation with Hierarchical GNN. In Proceedings of the ICDM ’20, Sorrento, Italy, 17–20 November 2020. [Google Scholar]
Yang, Y.; Zhang, Y.; Wang, X.; Li, X.; He, X.; Zhang, F.; Li, H.; Zhang, L.; Wang, Y.; Zhang, Y. Knowledge Graph Self-Supervised Rationalization for Recommendation. In Proceedings of the CIKM ’23, Birmingham, UK, 23–27 October 2023; pp. 1234–1243. [Google Scholar]
Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the IJCAI ’18, Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar]
Zhang, X.; Li, Y.; Wang, S.; Zhang, F.; Li, X.; He, X. GeoSAN: Geographical Self-Attention Network for Point-of-Interest Recommendation. In Proceedings of the SIGIR ’24, Washington, DC, USA, 14–18 July 2024; pp. 1–11. [Google Scholar]
Liu, F.; Zhang, R.; Xu, H.; Yang, T. Next Point-of-Interest (POI) Recommendation System. Mathematics 2025, 13, 1232. [Google Scholar]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.-S. Neural Collaborative Filtering. In Proceedings of the WWW ’17, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
Bao, K.; Zhang, J.; Zhang, Y.; Wang, W.; Feng, F.; He, X. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. In Proceedings of the SIGIR ’23, Taipei, Taiwan, 23–27 July 2023; pp. 1234–1243. [Google Scholar]

Figure 1. Illustration of the proposed S²DCRec model: (a) multi-graphs embedding module generation; (b) dual-channel contrastive alignment.

Figure 2. The result of Recall@K in top-K recommendation.

Figure 3. Effect of ablation study: (a) Yelp dataset; (b) Foursquare NYC dataset.

Figure 4. Impact of

α

,

β

, K, and

K^{'}

. (a) Impact of

α

. (b) Impact of

β

. (c) Impact of K. (d) Impact of

K^{'}

.

Figure 4. Impact of

α

,

β

, K, and

K^{'}

. (a) Impact of

α

. (b) Impact of

β

. (c) Impact of K. (d) Impact of

K^{'}

.

Figure 5. Impact of channel weight parameters

γ_{1}

,

γ_{2}

, and

γ_{3}

.

Figure 5. Impact of channel weight parameters

γ_{1}

,

γ_{2}

, and

γ_{3}

.

Figure 6. Visualization of model representation learning ability on Yelp: (a) Spatio-temporal channel embedding visualization; (b) Semantic channel embedding visualization; (c) Fused channel embedding visualization.

Table 1. Frequently used notations.

Symbol	Definition
u	Set of users, $U = {u_{1}, u_{2}, \dots, u_{m}}$
v	Set of POIs (Points-of-Interest), $V = {v_{1}, v_{2}, \dots, v_{n}}$
M	Number of users
N	Number of POIs
$u, u_{i}$	A specific user, $u_{i} \in U$
$v, v_{i}$	A specific POI, $v_{i} \in V$
$Y \in R^{M \times N}$	User-POI interaction matrix, entries indicate user interactions
$y_{u v}$	Interaction indicator between user u and POI v; 1 if interaction exists, otherwise 0
G	Spatio-Temporal Knowledge Graph (STKG)
E	Set of entities in STKG (including POIs, spatial, temporal, semantic entities)
R	Set of relations in STKG
$h, r, t$	Head entity, relation, tail entity of a knowledge triple respectively
$(h, r, t)$	A knowledge triple in STKG
A	Alignment set between POIs and KG entities, $A = {(v, e) ∣ v \in V, e \in E}$
$f (\cdot)$	Predictive function for recommendation
${\hat{y}}_{u v}$	Predicted probability that user u interacts with POI v

Table 2. Summary of key variables and their meanings in S²DCRec.

Variable/Symbol	Description
G	Original knowledge graph / User–POI–Entity tripartite graph
$G^{c}$	Collaborative Interaction Graph (User–POI subgraph)
$G^{s}$	Spatio-Temporal Graph (POI–Entity subgraph)
$G^{s e}$	Semantic Knowledge Graph (full tripartite graph)
$h_{i}^{(k)}$	Embedding of POI/item i at layer k
$h_{u}^{(k)}$	Embedding of user u at layer k
$h_{v}^{(k)}$	Embedding of semantic entity v at layer k
$N_{i}, N_{u}, N_{v}$	Neighbor sets of POI i, user u, and entity v, respectively
$ϕ (h_{r}, h_{x}^{(k)})$	Path composition function; element-wise interaction between relation r and node embedding $h_{x}^{(k)}$
$S_{i j}$	Cosine similarity between POIs i and j
${\hat{S}}_{i j}$	Sparsified POI–POI similarity matrix (top-k neighbors)
$\tilde{S}$	Normalized adjacency matrix for message passing
$z_{u}^{c}, z_{i}^{c}$	Cross-layer collaborative embeddings for user u and POI i (micro-level)
$z_{i}^{s}$	Cross-layer spatio-temporal embedding of POI/entity i (micro-level)
$z_{u}^{g}, z_{i}^{g}$	Aggregated semantic embeddings of user u and POI i (macro-level)
$z_{i}^{l}$	Fused micro-level embedding for POI i ( $z_{i}^{c} + z_{i}^{s}$ )
$p_{i}^{c}, p_{i}^{s}$	Projected embeddings in contrastive space (micro-level)
$p_{i}^{g}, p_{i}^{l}$	Projected embeddings in contrastive space (macro-level)
$L^{m i c r o}$	Micro-level contrastive loss (spatio-temporal channel)
$L^{m a c r o}$	Macro-level contrastive loss (semantic channel)
$z_{u}^{}, z_{i}^{}$	Final user and POI embeddings after weighted fusion
$α, β$	Learnable weights controlling fusion of embeddings from different modules
$\hat{y} (u, i)$	Predicted score for user u and POI i (dot-product of fused embeddings)
$L_{BPR}$	Bayesian Personalized Ranking loss for recommendation task
$L_{Total}$	Overall loss: combination of BPR, micro- and macro-level contrastive losses, plus regularization
$τ$	Temperature parameter for contrastive loss
$Θ$	All trainable parameters of the model (embeddings, encoder weights, projection matrices)
$K, K^{'}$	Number of layers in collaborative/spatio-temporal and semantic encoders

Table 3. Functional overview of encoders in S²DCRec.

Encoder	Input Data	Captured Dependencies	Output Representation	Role in Dual-Channel Alignment
LightGCN	User–POI implicit feedback matrix Y with contextual attributes (temporal, spatial)	High-order user–POI collaborative dependencies	Dense embedding vectors	Forms the micro-level collaborative channel, capturing short-term preferences and structural interactions.
GraphTrans	User sequential behavior (time-ordered POI visits) with temporal and spatial features	Spatio-temporal dynamic dependencies	Sequential dependency embeddings	Forms the micro-level spatio-temporal channel, capturing temporal evolution and spatial patterns, aiding micro-level alignment
LRC-GNN	POI attributes, categories, multi-modal information, and semantic/time/space entities	Multi-dimensional semantic correlations and long-range dependencies	Semantically enriched embeddings	Forms the macro-level semantic channel, aligning with micro-level fused representations to ensure semantic consistency

Table 4. Estimated graph statistics for foursquare NYC and Yelp.

Graph	Dataset	#Nodes	#Edges	Relation Types
Semantic Graph	Foursquare NYC	34,400	62,250	8–12
Spatio-Temporal Graph	Foursquare NYC	34,400	227,428	5
Collaborative Graph	Foursquare NYC	34,400	227,428	3
Semantic Graph	Yelp	60,569	27,048	8–12
Spatio-Temporal Graph	Yelp	60,569	1,289,652	5
Collaborative Graph	Yelp	60,569	1,289,652	3

The prefix “#” in #Nodes and #Edges indicates numerical counts, avoiding ambiguity with categorical labels.

Table 5. Estimated interaction statistics for foursquare-NYC and Yelp.

Dataset	#Users	Total Edges	Avg. Interactions per User	Approx. Matrix Density
Foursquare-NYC	34,400	227,428	6.6	0.019%
Yelp	60,569	1,289,652	21.3	0.035%

The prefix “#” in #Users denotes a count, avoiding ambiguity with categorical labels.

Table 6. Statistics and hyper-parameter settings for the two datasets.

Dataset	Foursquare NYC	Yelp
#Users	2275	45,545
#Items	31,125	13,524
#Interactions	227,428	1,289,652

#Users and #Items denote the number of unique users and POIs, respectively, while #Interactions represents the total number of user–POI check-in records.

Table 7. S²DCRec dual-channel contrastive learning hyperparameters and experimental settings.

Hyperparameter/Setting	Description	Optimal Value
—Dual-Channel Contrastive Learning Parameters—
$α$	Weight of microscopic contrastive loss $L_{micro}$	0.2
$β$	Trade-off between contrastive loss and recommendation loss	0.1
K	Aggregation layers in semantic channel	2
$K^{'}$	Aggregation layers in spatio-temporal channel	2
$γ_{1}$	Weight of BPR loss	0.4
$γ_{2}$	Weight of micro-level contrastive loss	0.3
$γ_{3}$	Weight of macro-level contrastive loss	0.3
$τ$	Temperature parameter for contrastive loss	0.07
$N_{neg}$	Number of negative samples per anchor	1024
$d_{proj}$	Projection embedding dimension	128
—Model and Optimization Settings—
Embedding dimension	Unified for all models	64
Initialization	Parameter initialization method	Xavier
Optimizer	Training optimizer	Adam
Batch size	Training samples per batch	4096
Epochs	Maximum training iterations	50
—Hyperparameter Search Range—
Learning rate $η$	Grid search range	{0.0001, 0.0003, 0.001, 0.003}
L2 regularization $λ$	Grid search range	{1 × 10⁻⁷, 1 × 10⁻⁶, 1 × 10⁻⁵, 1 × 10⁻⁴, 1 × 10⁻³}
—Evaluation Protocols—
CTR Prediction Task	Evaluation metrics	AUC, F1
Top-K Recommendation	Evaluation metric	Recall@K

All hyperparameters are optimized separately for Foursquare NYC and Yelp to ensure reproducibility and robustness under sparse-interaction and cold-start scenarios.

Table 8. The result of AUC and F1 in CTR prediction. The best results are in boldface and the second best results are underlined.

Model	Yelp AUC	Yelp F1	Foursquare NYC AUC	Foursquare NYC F1
BPRMF	0.8850 (−5.65%)	0.7881 (−10.64%)	0.6821 (−12.27%)	0.6143 (−11.15%)
NCF	0.8975 (−4.18%)	0.8012 (−8.81%)	0.7015 (−9.08%)	0.6298 (−8.36%)
TALLRec	0.9050 (−3.32%)	0.8290 (−5.21%)	0.7180 (−6.63%)	0.6425 (−6.23%)
STGCN	0.9080 (−2.96%)	0.8340 (−4.44%)	0.7243 (−5.64%)	0.6452 (−5.63%)
GeoSAN	0.9025 (−3.55%)	0.8305 (−4.96%)	0.7150 (−7.01%)	0.6401 (−6.59%)
STP-Rec	0.9100 (−2.75%)	0.8420 (−3.55%)	0.7295 (−4.95%)	0.6520 (−4.63%)
LightGCN	0.8920 (−4.87%)	0.8380 (−4.06%)	0.6990 (−9.59%)	0.6320 (−7.97%)
KGRec	0.9085 (−2.96%)	0.8450 (−3.26%)	0.7320 (−4.53%)	0.6550 (−4.05%)
TransGNN	0.9105 (−2.69%)	0.8465 (−3.01%)	0.7335 (−4.43%)	0.6560 (−4.04%)
S²DCRec	0.9350 *	0.8720 *	0.7660 *	0.6825 *

* The best results are in boldface.

Table 9. Training time and HR@10 comparison across Yelp and Foursquare NYC datasets.

Model	Avg. Training Time (s/epoch)	HR@10 (Yelp)	HR@10 (Foursquare NYC)	Improvement vs. Strongest Baseline
LightGCN	120	0.460	0.420	–
TransGNN	150	0.475	0.435	–
STP-Rec	160	0.480	0.440	–
S²DCRec	155	0.515	0.470	+7.3% (Yelp), +6.8% (Foursquare NYC)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bu, C.; Liu, Y.; Lu, J.; Huang, M.; Li, M.; Li, J. Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation. Big Data Cogn. Comput. 2025, 9, 322. https://doi.org/10.3390/bdcc9120322

AMA Style

Bu C, Liu Y, Lu J, Huang M, Li M, Li J. Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation. Big Data and Cognitive Computing. 2025; 9(12):322. https://doi.org/10.3390/bdcc9120322

Chicago/Turabian Style

Bu, Chong, Yujie Liu, Jing Lu, Manqi Huang, Maoyi Li, and Jiarui Li. 2025. "Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation" Big Data and Cognitive Computing 9, no. 12: 322. https://doi.org/10.3390/bdcc9120322

APA Style

Bu, C., Liu, Y., Lu, J., Huang, M., Li, M., & Li, J. (2025). Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation. Big Data and Cognitive Computing, 9(12), 322. https://doi.org/10.3390/bdcc9120322

Article Menu

Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation

Abstract

1. Introduction

2. Related Work

2.1. Traditional POI Recommendation Methods

2.2. Graph-Based POI Recommendation

2.3. Spatio-Temporal and Contrastive Learning-Based Methods for POI Recommendation

3. Problem Formulation

4. Methodology

4.1. Multi-Graphs Embedding Module Generation

4.1.1. Collaborative Interaction Embedding Module

4.1.2. Spatio-Temporal Embedding Module

4.1.3. Semantic Embedding Module

4.2. Contrastive Alignment on Spatio-Temporal and Semantic Dual-Channel

4.2.1. Contrastive Alignment on Spatio-Temporal Channel

4.2.2. Contrastive Alignment on Semantic Channel

4.3. Prediction and Multi-Task Training

4.4. Encoder Functional Overview

4.5. Graph Statistics and Preprocessing

5. Experiment

5.1. Evaluation Metrics

5.2. Experiment Settings

5.2.1. Dataset

5.2.2. Baselines

5.2.3. Evaluation Metrics and Parameter Settings

5.3. Performance Comparison (RQ1)

5.4. Ablation Studies (RQ2)

5.5. Sensitivity Analysis (RQ3)

5.5.1. Impact of Hyperparameters α , β , K, and K ′

5.5.2. Impact of Channel Weight Parameters γ 1 , γ 2 , and γ 3

5.6. Visualization (RQ4)

5.7. Discussion and Efficiency Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.5.1. Impact of Hyperparameters $α$ , $β$ , K, and $K^{'}$

5.5.2. Impact of Channel Weight Parameters $γ_{1}$ , $γ_{2}$ , and $γ_{3}$