1. Introduction
With the rapid proliferation of ubiquitous mobile sensing and online location-sharing platforms, location-based social networks (LBSNs) have become a fundamental component of modern urban information systems. They play a key role in delivering personalized and context-aware experiences in diverse applications, including smart tourism [
1,
2], intelligent transportation, and mobile commerce. As a core task in LBSNs, next Point-of-Interest (POI) recommendation aims to predict the next location a user is likely to visit based on their historical trajectories [
3,
4,
5]. Predicting the next POI is a particularly dynamic challenge, as it requires modeling immediate user movement. Accurately capturing such mobility patterns is essential for enhancing the intelligence and responsiveness of graph-based urban computing applications.
Most current approaches to next POI recommendation formulate the task as a sequential prediction problem. The research process has evolved from early Markov chain models [
6,
7] to more sophisticated neural approaches. These range from recurrent neural networks (RNNs) [
8,
9,
10] to self-attention mechanisms [
11,
12,
13], which have progressively enhanced the modeling of user behavior sequences. Most existing approaches have focused on enriching these sequential models by incorporating critical spatio-temporal information, which has proven vital for accurate trajectory modeling [
14,
15]. Despite their success, most of these methods focus on the intra-sequence features of individual users, while ignoring the high-order correlation information associated with different users [
3,
8,
12]. A core trend in the field of graph learning is the capturing of high-order neighbor dependencies and the modeling of complex relational structures [
16,
17,
18]. As a result, an increasing number of studies [
4,
19,
20] have employed GNNs and their powerful extension. For example, hypergraph neural networks (HGNNs) are used to learn more expressive representations of users and POIs, thereby improving recommendation performance. Despite the significant progress made in next POI recommendation, most existing methods still face two key challenges that remain insufficiently addressed.
(1) Entangled representations of user preferences. Many earlier works have failed to fully recognize the complexity that user interests naturally vary across multiple perspectives and evolve over time, often producing user representations that are confounded. In next POI recommendation scenarios, user–POI interactions are typically influenced by a mix of implicit contextual signals, including geographic distance, temporal patterns, and behavioral tendencies [
5,
21]. For example, a user may habitually purchase essentials from a nearby store while ignoring a preferred specialty shop located farther away. Nonetheless, existing graph-based and hypergraph-based approaches [
18,
22] often conflate multiple behavioral signals. They typically model user–POI interactions at a high level, without explicitly accounting for the nuanced factors that influence user decisions. Such representations obscure the distinct motivations behind user behavior, making it difficult to uncover fine-grained and context-aware user preferences. Therefore, identifying and modeling user intentions from multiple perspectives remains a critical challenge that requires urgent attention.
(2) Challenges in cross-perspective synergy. Many current approaches fall short in effectively capturing the essential synergy across diverse perspectives. This limitation restricts their ability to exploit complementary signals during representation learning. Complementarity refers to the integration of heterogeneous information from multiple perspectives to achieve a more holistic understanding of user behavior and improve recommendation accuracy. Regarding the next POI recommendation task, several studies have explored multi-view or disentangled learning frameworks [
23,
24,
25]. A common approach in these works is to construct representations for individual perspectives separately and then aggregate them in a straightforward manner for prediction. However, such strategies often overlook the relational nuances between perspectives. For instance, when a user frequently visits a location, considering only interaction frequency may ignore the influence of visit timing and geographical context. Furthermore, some existing works [
18,
26] address cross-perspective cooperation only at the output level, making it difficult to ensure any interactive reinforcement during the representation phase. Therefore, there is an urgent need for more sophisticated mechanisms to model the cooperation between different perspectives and foster cross-perspective interactive reinforcement.
To tackle these challenges, we propose a multi-relational hypergraph learning (MRHL) framework, which consists of two main components.
(1) Multi-view representation disentanglement. To address the first limitation, we propose an approach that disentangles representations by modeling the three crucial views of interaction frequency, time decay, and geographical proximity [
20,
23,
25]. These three views are selected as representative behavioral perspectives as they jointly capture the intensity, temporal dynamics, and spatial constraints of user mobility. Moreover, the proposed framework is not restricted to these views and can be naturally extended to incorporate additional behavioral or contextual factors. Importantly, these views are not implemented as simple reweighted variants of a shared graph structure. Instead, each view is modeled as an independent hypergraph that encodes a distinct type of relational dependency, enabling the model to explicitly capture heterogeneous semantics across different behavioral dimensions. Specifically, unlike simple graphs that model relationships in a pairwise manner, we construct a distinct hypergraph for each view to capture high-order correlations among users and POIs. This design enables the model to represent group-wise behavioral patterns and joint effects that naturally arise in user mobility data. Such a design goes beyond conventional multi-view graph modeling by disentangling heterogeneous relational structures rather than implicitly mixing them within a unified adjacency space. Following this disentanglement, we introduce a novel cascaded enhancement mechanism to synthesize the final user representation, improving personalization and interpretability. Unlike conventional fusion strategies based on parallel aggregation or residual connections, our cascaded enhancement mechanism operates across heterogeneous relational views, each encoding distinct behavioral semantics. Specifically, the proposed fusion follows a sequential dependency paradigm, where representations from preceding views are incorporated as complementary context to enrich subsequent ones. This process can be interpreted as cross-view semantic augmentation rather than simple additive combination. It enables the model to capture complementary and interdependent patterns across behavioral perspectives.
(2) Cross-view synergistic enhancement. To address the second limitation, we further devise a cross-view contrastive learning strategy that promotes semantic consistency across different views through a self-supervised objective. Instead of simply aggregating multi-view representations, this strategy encourages the model to capture complementary and mutually reinforcing patterns across views. As a result, the learned representations can better preserve consistent user preference signals shared among different behavioral perspectives.
To systematically address the aforementioned challenges and guide our methodology, this paper aims to answer the following core research questions (RQs):
RQ1: How can we effectively disentangle the multi-perspective driving factors (e.g., interaction frequency, time decay, and geographical proximity) behind user check-in behaviors to capture fine-grained user intents?
RQ2: How can we design an effective mechanism to foster cross-view interactive reinforcement and synergy, thereby mitigating the data sparsity issue in POI recommendation?
RQ3: Does the proposed MRHL framework outperform existing state-of-the-art models, and what is the specific contribution of each disentangled view to the overall performance?
The primary contributions can be outlined as follows:
We propose novel representations of three structurally diverse hypergraphs from the views of visit frequency, time decay, and geographical proximity. Furthermore, we enhance the aggregation and propagation strategies in the hypergraph convolutional network, to mitigate the issues of entanglement and information sparsity in user representation learning.
We employ a cross-view contrastive learning strategy that leverages auxiliary tasks to enhance the supervision across different views, thereby effectively mitigating the difficulty of capturing complementary recommendation cues during training.
Comprehensive experiments conducted on three real-world datasets demonstrate the superior performance of our proposed MRHL model, compared to a range of cutting-edge approaches for next POI recommendation.
3. Preliminaries
3.1. Formalization of the Task
Let denote the set of users and denote the set of POIs, where U and L denote the total number of users and POIs, respectively. Each POI represents a specific geographic location defined by its longitude and latitude coordinates. Each user is associated with a chronological sequence of check-in records denoted as , where refers to the POI visited and denotes the corresponding timestamp of visit location .
To effectively capture user behavior, we generate training samples by segmenting each user’s complete check-in sequence into multiple trajectory–target pairs. Specifically, for a user’s check-in sequence , we generate samples in an auto-regressive manner. For the i-th sample, the trajectory is represented as . In the next POI recommendation task, the model aims to produce a ranked list of POIs from the entire candidate set , ensuring that the ground-truth POI appears at a high position in the ranking.
3.2. Constructed Hypergraphs
A hypergraph [
24,
32,
34] is an extension of a conventional graph, distinguished by its ability to connect two or more vertices within a single hyperedge. Formally, a hypergraph is defined as
, where
denotes the set of vertices and
denotes the set of hyperedges. To characterize the topology of a weighted hypergraph, a weighted incidence matrix
is employed, in which
quantifies the connection strength between vertex
and hyperedge
. If vertex
v is contained in hyperedge
,
represents the corresponding connection weight; otherwise,
. The weight
can be determined using various interaction metrics, such as interaction frequency, time decay or geographical proximity.
4. Methodology
In this section, we provide a comprehensive description of the proposed MRHL framework. Next POI prediction is inherently influenced by multiple heterogeneous factors, and the dominance of these factors may vary across users and time. Motivated by this observation, MRHL adopts a multi-view modeling strategy that explicitly factorizes user check-in behaviors into several complementary behavioral perspectives.
As depicted in
Figure 1, the complete pipeline of our methodology can be thoroughly expounded through three primary steps:
Step 1: Multi-view Hypergraph Construction. We begin by constructing multiple factorized hypergraph representations derived from users’ check-in data, informed by three distinct metrics, i.e., interaction frequency, time decay, and geographical proximity. Instead of directly merging all relational information into a single hypergraph, we construct separate hypergraphs for each relation to avoid information interference and better preserve their heterogeneous characteristics.
Step 2: Disentangled Representation Learning. We then perform disentangled representation learning using a hypergraph convolutional network equipped with an enhanced aggregation–propagation mechanism to achieve feature decomposition.
Step 3: Cascaded Fusion and Cross-view Contrastive Learning. Next, based on the hypergraph structures, we integrate the learned representations through a cascaded enhancement strategy to capture multi-view user preferences. Furthermore, cross-view contrastive learning is employed to strengthen the supervisory signals and promote consistency across different views.
Finally, we present the prediction and optimization details. All notations used in this paper are summarized in
Table 1.
Figure 1.
Illustration of the proposed framework MRHL. (a) Hypergraph learning for interaction frequency, time decay, and geographical proximity. (b) Cascaded enhancement fusion & cross-view contrastive learning.
Figure 1.
Illustration of the proposed framework MRHL. (a) Hypergraph learning for interaction frequency, time decay, and geographical proximity. (b) Cascaded enhancement fusion & cross-view contrastive learning.
4.1. Multi-Relational Hypergraph Construction
In the context of next POI recommendation, the interactions between users and POIs exhibit diverse and intricate patterns, encompassing user–POI interaction frequency, time decay in sequential transitions, and the geographical proximity among POIs. Existing studies [
16,
18] typically employ conventional graph structures, where users and POIs are modeled as nodes and their pairwise connections are represented by edges. However, such graph formulations are inherently restricted to binary relations and fail to capture higher-order neighborhood dependencies under specific semantic contexts. Inspired by the structural flexibility of hypergraphs, we propose three heterogeneous hypergraph designs to comprehensively encode these multi-perspective relationships.
4.1.1. Interaction Frequency Hypergraph
The interaction frequency hypergraph is designed to capture high-order dependencies between users and POIs based on historical visiting frequencies. Specifically, we define the interaction frequency hypergraph as
, where
denotes the set of POIs. Each user is associated with a hyperedge that summarizes the POIs visited in their historical sequence
, resulting in a hyperedge set
that covers all users. To quantify the interaction intensity between users and POIs, we construct a frequency-based incidence matrix
, where each column corresponds to a user (i.e., a hyperedge) and each row corresponds to a POI. Each entry
records the total number of times user
u has visited POI
l across the entire visiting history. Formally, it can be defined as
where entries corresponding to POIs not visited by user
u are set to 0. It is worth noting that although an entire user sequence is conceptually treated as a hyperedge, the hypergraph is implemented as a fixed POI–user incidence matrix. Therefore, the size of each hyperedge is bounded by the number of POIs rather than the length of the user sequence, ensuring stable and efficient computation. This formulation enables the hypergraph to effectively encode both intra-sequence and cross-sequence relationships within user trajectories. By leveraging this enriched representation, the model can more accurately identify users with analogous visiting behaviors and better characterize their varying degrees of preference for different POIs, enhancing its ability to model user intent.
4.1.2. Time Decay Hypergraph
In the conventional hypergraph structures, hyperedges are inherently undirected, which limits their ability to capture directional dependencies such as POI–POI sequential transitions. To address this limitation, we introduce a directed hypergraph to model these sequential relationships more effectively. Specifically, we design a time decay hypergraph
, where the vertex set
represents all POIs and the hyperedge set
is constructed by aggregating temporal dependency contexts across all user trajectories. For each user
u, every POI in the sequence
is connected to all subsequent POIs, where the edge weight between two POIs is determined by a time decay function:
where
denotes the time interval in hours between visiting
and
, with
. If multiple transitions occur between the same POI pair, their corresponding weights are summed. Consequently, the incidence matrix
encodes the membership strength of POIs to temporal context hyperedges. Formally, each element
is computed as follows:
where rows correspond to source nodes and columns correspond to target nodes. Although the incidence matrix mathematically resides in an
space, user mobility is inherently sparse. In practice, users only transition between a highly limited subset of locations. To ensure scalability and avoid the
memory explosion associated with dense structures,
is strictly implemented and stored using a sparse matrix format. By only allocating memory for actual non-zero transitions, the space complexity is drastically reduced to
, where
is the number of valid non-zero interactions. This time decay hypergraph captures global transition patterns aggregated across all user trajectories while emphasizing temporally correlated dependencies. Unlike a conventional directed graph that models only pairwise transitions, this hypergraph formulation enables high-order message aggregation among multiple temporally related POIs simultaneously. Such a design allows the model to capture collective temporal context patterns beyond adjacent transitions, which cannot be fully represented by edge-based message-passing mechanisms.
4.1.3. Geographical Proximity Hypergraph
The hypergraph is designed to model the spatial correlations among POIs under distance constraints. Specifically, we construct a hypergraph
based on the distances between POIs, where
denotes the set of POIs. In
, hyperedges are formed by connecting POIs whose pairwise Haversine distance does not exceed a predefined threshold
. The incidence matrix
quantitatively represents the spatial relationships between POIs, with each entry
computed as
where
is the Haversine distance between
and
, with
. By incorporating this continuous distance-based weighting scheme, the hypergraph is able to capture varying strengths of geographical proximity among POIs, thereby modeling the spatial correlations in a more nuanced and realistic manner.
4.2. Multi-Relational Hypergraph Convolutional Networks
To effectively capture multi-view and multi-relational POI representations from the three heterogeneous hypergraphs, we design customized improvements to the aggregation and propagation mechanisms of the hypergraph convolutional network. Prior to the encoding phase, the POI embedding matrix is first initialized, where d denotes the embedding dimensionality. In the following, we elaborate on the three proposed hypergraph neural network models.
4.2.1. Interaction Frequency Hypergraph Convolutional Network
After constructing the interaction frequency hypergraph, we are able to better characterize the higher-order dependencies among nodes. To this end, we propose an interaction frequency-based hypergraph convolution approach to learn high-level node representations. Specifically, we first derive hyperedge embeddings by aggregating the features of neighboring nodes connected to each hyperedge and then employ these hyperedge embeddings to refine the node representations with higher-order information. Formally, the hyperedge embedding matrix
is obtained as follows:
where
denotes the hyperedge degree matrix,
is the vertex–hyperedge incidence matrix of the frequency hypergraph, and
E represents the initialized Node embedding matrix. Subsequently, the updated node embedding matrix
can be computed as follows:
where
represents the node degree matrix. Specifically, Equation (
5) aggregates features of all nodes connected to each hyperedge, capturing high-frequency co-occurrence patterns of POIs. Equation (
6) then updates each node by integrating its incident hyperedges, allowing nodes to incorporate information from other frequently co-selected POIs. Based on the above two steps, our hypergraph convolution process can effectively encode frequency-enhanced information underlying users’ POI selection behavior.
To further enhance the capacity to model complex dependencies among nodes, we employ a multi-layer hypergraph convolution scheme. Within this scheme, where the information propagation from the
-th layer to the
ℓ-th layer for node embedding matrix is formulated as
where
denotes the node embedding matrix updated at the
ℓ-th layer of the hypergraph. These skip connections are strategically employed to enrich the semantic information of nodes and mitigate the over-smoothing phenomenon in hypergraph neural networks. Finally, the embeddings produced by all layers are averaged to compute the final node embeddings
, which can be formulated as
where
K denotes the number of layers in the hypergraph convolution network and
represents the initialized POI embedding matrix
E. In this manner, the updated node representations can effectively capture high-order relational features. Moreover, mean pooling technology is applied during the hypergraph convolution process to enhance both the efficiency and performance of hypergraphs across different views.
4.2.2. Time Decay Hypergraph Convolutional Network
Since frequency hypergraphs are limited in capturing directed relationships during convolution operations, we propose a time decay hypergraph convolutional network to model temporal transition strengths between POIs. This approach emphasizes recent interactions while preserving the global sequential structure. The network also adopts a two-step process in which hyperedges aggregate directionally associated nodes and then update node representations through hyperedges. Similar to the aggregation mechanism of the frequency hypergraph, the time decay hyperedge embedding matrix
and the updated node embedding matrix
can be expressed as follows:
where
and
denote the hyperedge degree matrix and the node degree matrix, respectively. From a computational perspective, since
is highly sparse, the hypergraph convolution operations in Equation (
9) are executed via Sparse Matrix-Matrix Multiplication (SpMM). Consequently, the propagation process scales linearly with the number of non-zero interactions, with a time complexity of
. This is significantly more efficient than the
complexity of dense matrix multiplication. This ensures that the runtime remains highly efficient even for datasets like Gowalla with tens of thousands of POIs. The proposed time decay mechanism balances global transitions and short-term preferences, mitigating outdated effects and enhancing adaptability to dynamic user behaviors.
By applying sequential convolution, the time decay hypergraph models the global POI–POI transition patterns. Following the propagation scheme of the K layer in the frequency hypergraph, we derive the POI representations for the time decay view.
4.2.3. Geographical Proximity Hypergraph Convolutional Network
In proximity view
, the spatial correlation between POIs gradually weakens as the distance increases. To capture this spatial decay effect, we introduce a geographical proximity hypergraph convolutional network. As illustrated in
Figure 1, we still employ a node–hyperedge–node message-passing scheme to generate POI representations. Similar to the aggregation mechanisms of the aforementioned hypergraphs, the geographical proximity hyperedge embedding matrix
and the corresponding node embedding matrix
can be expressed as follows:
where
denotes the hyperedge degree matrix of the proximity hypergraph and
represents its node degree matrix. This design not only enhances the model’s capability to capture spatial correlations but also effectively suppresses noise from distant POIs, thereby improving both interpretability and generalization performance.
Similar to the previous hypergraph convolutional process, we still adopt a K layer network to enrich the semantic representations of nodes and incorporate a skip connection mechanism to alleviate the over-smoothing issue. Consequently, we obtain the node embedding matrix of the proximity hypergraph denoted as .
By introducing three differentiated aggregation and propagation mechanisms, we can capture diverse POI representations from the views of interaction frequency, time decay, and geographical proximity.
4.3. Cascaded Enhancement Fusion for User Preferences
Based on the above procedure, we obtain the POI embedding matrices
,
, and
from the three distinct hypergraph views. Given a segmented user trajectory
, the user preference representations can be derived under different hypergraph views by looking up the corresponding POI embeddings in the matrices and summing them accordingly. Formally, the user preferences are expressed as
where
indicates the user preference under the interaction frequency view,
pepresents the preference under the time decay view, and
corresponds to the preference under the geographical proximity view. These user representations learned from different views collaboratively work together to drive user behavior.
This subsection aims to explore how to effectively fuse user representations from different views to determine the final preference. Conventional fusion strategies either neglect the interactions among different views or fail to capture fine-grained cross-view relationships, such as linear fusion and adaptive fusion. To address these limitations, we propose a novel cascaded enhancement fusion strategy. Unlike conventional residual connections that perform homogeneous feature aggregation within a shared representation space, the proposed fusion operates across heterogeneous relational views, each encoding distinct behavioral semantics. This method performs sequential cross-view propagation, allowing the user preference information from the preceding view to progressively integrate into the preferences of subsequent views. Although the formulation appears as an additive operation, it should be interpreted as a cross-view semantic enrichment process rather than a simple combination. Specifically, the representation from a preceding view serves as complementary contextual information, which is injected into the subsequent view to enhance its semantic expressiveness. Formally, the process can be expressed as
where
denotes the enhanced user preference under the time decay view and
denotes the enhanced preference under the geographical proximity view. Through this sequential refinement process, information from earlier views is progressively accumulated and propagated, enabling later views to capture richer contextual dependencies and cross-view complementary patterns. In this way, the mechanism not only generates more discriminative and robust user representations but also strengthens the correlation and consistency among different view-specific representations.
4.4. Multi-Relational Contrastive Learning
After obtaining the POI embeddings and user preference vectors, we design a multi-relational contrastive learning framework to enforce consistency and correlation across different views. This framework encourages the representations of the same user or POI from different views to be closer in the embedding space, thereby improving the effectiveness of multi-view information fusion. Specifically, the representations of the same user or POI across views are regarded as positive pairs such as
and
, while those of different users or POIs are regarded as negative pairs. Formally, the contrastive loss function for user representations between the interaction frequency and temporal decay views can be defined as follows:
where
denotes the total number of samples,
B represents the mini-batch, and
is the temperature hyperparameter.
This formulation follows a standard contrastive learning objective. The numerator corresponds to the positive pair (i.e., representations of the same user-POI interaction across different views), while the denominator aggregates the similarities from both positive and negative pairs within the mini-batch. By optimizing the contrastive objective, the model increases the similarity of the positive pair while reducing the relative similarity of negative pairs (where ), thereby effectively pushing dissimilar representations apart in the embedding space.
Based on the above formulation, the contrastive loss between the frequency view and the proximity view for users can be expressed as
, while that between the decay view and the proximity view is denoted as
. We then aggregate the contrastive losses across any pair of views, yielding the overall contrastive loss for modeling user preferences as
The contrastive learning objective is symmetrically applied to POI representations across different relational views. Specifically, for each POI, its embeddings obtained from different views (i.e., frequency, time decay, and proximity) are treated as positive pairs, while embeddings of other POIs are treated as negative pairs. Following the same formulation as in (
13)–(
14), the contrastive losses between different view pairs are computed and aggregated, yielding the final contrastive loss for POI embeddings, denoted as
. By summing the contrastive losses of users and POIs, we derive the overall final contrastive loss as
4.5. Prediction and Optimization
As the geographical proximity pattern is sparser and less expressive compared with the decay and frequency patterns, we enhance the user geographical proximity representation . To incorporate user preferences from multiple views into the geographical proximity view, we employ a concatenation operation denoted by ⊕. In particular, the geographical preference is first concatenated with the frequency preference and the time decay preference . The combined vector is then passed through a multi-layer perceptron (MLP) to obtain the refined geographical representation .
By leveraging the user preference representations
,
,
obtained from different views together with the corresponding POI embeddings
,
,
, the interaction score between user
u in a specific trajectory
and POIs is defined as
where
; each value
represents the preference score of user
u for POI
.
To optimize the alignment between the predicted distribution and the actual user behavior, we adopt the cross-entropy loss function, which is formally defined as
where
denotes the ground-truth label, with
k indicating the index of the POI that is actually visited.
represents the
regularization over all model parameters to mitigate overfitting and
is the weight coefficient for the
regularization term.
The overall loss of the model can be formulated as a combination of the contrastive learning loss and the cross-entropy loss, expressed as follows:
where the coefficient
is introduced to balance the two types of losses.
5. Experiments
In this section, we provide a systematic account of the experimental design and comprehensively assess the performance of MRHL from multiple perspectives. Specifically, our evaluation focuses on overall recommendation effectiveness, the contribution of different components, the impact of the cascaded fusion strategy, hyperparameter sensitivity, and performance in highly sparse scenarios.
5.1. Experimental Settings
5.1.1. Data Sets
We conduct experiments on three widely used real-world check-in datasets, namely, New York (NYC), Tokyo (TKY), and Gowalla. These datasets exhibit different levels of sparsity and distinct patterns of user mobility trajectories, as detailed below.
NYC and TKY (
https://sites.google.com/site/yangdingqi/home/foursquare-dataset, accessed on 1 April 2026): These two datasets are derived from the Foursquare platform, comprising user check-in records collected in the metropolitan areas of Tokyo and New York City. The collection period spans from 12 April 2012 to 16 February 2013.
Gowalla (
https://snap.stanford.edu/data/loc-gowalla.html, accessed on 1 April 2026): This dataset captures user check-in activities on the Gowalla platform worldwide, covering the period from February 2009 to October 2010, and contains rich spatial and temporal information.
Following prior work to ensure data quality [
12,
16], users and POIs with fewer than 5 interactions are filtered out from the NYC and TKY datasets. For the Gowalla dataset, users with fewer than 20 interactions and POIs visited fewer than 30 times are removed, and only users who interacted with at least 10 locations are retained.
Table 2 summarizes the statistics of the three datasets. Each user’s check-in records are arranged in chronological order, with the first 80% used for training, the middle 10% for validation, and the remaining 10% for testing. This strict chronological partition is widely adopted as the standard evaluation protocol in sequence-based recommendation tasks. Unlike random cross-validation, it effectively prevents temporal data leakage, ensuring that future check-ins are not inappropriately utilized to predict past mobility behaviors.
It is worth noting that while these datasets were collected in earlier years, they remain the most widely adopted and standardized benchmarks in the POI recommendation community. Due to increasingly strict global privacy regulations in recent years, the public release of contemporary, large-scale, and fine-grained human mobility trajectories has been heavily restricted. Therefore, utilizing these established benchmark datasets is essential to ensure fair, rigorous, and reproducible comparisons with existing state-of-the-art baseline models.
5.1.2. Baselines
To evaluate the effectiveness of MRHL, we compare it with seven baseline methods, where two belong to general recommendation models (SAE-NAD and TEMN) and the remaining five correspond to sequential recommendation models (STGCN, PLSPL, TCKT, PG2Net and Diff-POI).
SAE-NAD [
12] employs a multi-dimensional attention mechanism to adaptively model the varying importance of user preferences across different dimensions.
TEMN [
3] proposes a deep architecture that nonlinearly integrates topic modeling with memory networks, thereby leveraging both the global structure of latent patterns and the advantages of local neighborhood-based features.
STGCN [
18] constructs a multi-graph representation of user records to integrate all contextual information and designs scoring functions to capture users’ periodic patterns for recommendation.
PLSPL [
8] jointly models users’ long- and short-term preferences via attention and parallel LSTMs, which are linearly combined to characterize the user preference.
LSTPM [
28] employs a nonlocal network to model long-term preferences and utilizes a geo-dilated LSTM to capture non-consecutive geographical correlations.
PG2Net [
46] captures users’ personalized preferences and group-level spatio-temporal preferences through Bi-LSTM-based sequential modeling and auxiliary representation learning.
Diff-POI [
16] employs two graph modules to model users’ visit sequences and spatial features and incorporates a diffusion-based sampling strategy to capture visit trends.
5.1.3. Evaluation Metrics
To evaluate the performance of each model, we adopt three widely used metrics in sequential recommendation, including Recall@K, Normalized Discounted Cumulative Gain (NDCG@K), and Mean Reciprocal Rank (MRR).
Recall@K measures the proportion of ground-truth POIs successfully captured within the top-K recommended results, and is defined as follows:
where
R denotes the total number of samples,
represents the set of top
K recommended POIs, and
indicates the ground-truth POI visited by the user. In the next POI recommendation task,
.
NDCG@K evaluates the ranking quality by assigning exponentially decayed weights to positions within the top
K recommendations, and its formulation for the next POI recommendation task is given as follows:
where
denotes the position of the ground-truth POI within the top
K recommendation list. If the POI does not appear in the list,
is set to
∞.
MRR measures the reciprocal rank of the ground-truth position in the recommendation list, serving to quantify the overall ranking performance, and is defined as follows:
5.1.4. Parameter Settings
All experiments are conducted on an NVIDIA 3080Ti GPU using the PyTorch (version 2.8.0) framework. For the baseline methods, we adopt the parameter settings reported in the original papers and perform hyperparameter tuning on the three datasets. For training the MRHL model, we employ the Adam optimizer with a learning rate of and weight decay of . The embedding dimension for users and POIs is set to , and the batch size is fixed at 1024. For data preprocessing, the order of the auto-regressive model is set to 100. For trajectories longer than this threshold, only the most recent 100 POIs are retained to preserve the latest user preferences. For shorter sequences, zero-padding is applied to ensure uniform input dimensions for efficient batch computation. Following prior empirical settings, the distance threshold is set to 1 km. The number of convolutional layers is tuned from , and the temperature parameter is tuned within . The regularization weight is set to , and the balance coefficient is fixed at 0.1.
5.2. Performance Comparison with Baselines
The overall performance of all baseline methods and the proposed MRHL model is summarized in
Table 3. Although earlier methods such as STAN and GETNEXT are relevant, they are not included in the empirical comparison. More recent methods (e.g., PG2Net and Diff-POI) have been shown to achieve superior performance under similar evaluation settings. To verify the robustness of our model, all results are reported as the mean of five independent runs with different random initializations. The variance across runs is consistently negligible, with all standard deviations below 0.002. In addition, a paired
t-test shows that the improvements of MRHL over the best-performing baselines are statistically significant (
). Based on these results, several key findings can be drawn. The proposed MRHL model consistently surpasses all baseline methods across the three benchmark datasets under various evaluation metrics. This superior performance primarily stems from two design aspects. First, MRHL constructs multi-relational hypergraphs guided by interaction frequency, time decay, and geographical proximity information. This design enables the model to capture user preferences from multiple views and enhances semantic expressiveness through an improved hypergraph convolution process. Second, a multi-view contrastive objective is introduced to encourage information exchange across different views. This mechanism enhances the representation capacity of each specific view and allows the model to leverage self-supervised signals to uncover richer and more comprehensive recommendation patterns.
In the baseline comparison, we first observe that sequence-based recommendation models consistently outperform general recommendation models. This advantage arises from the stronger influence of users’ historical trajectories on their subsequent behavioral patterns, enabling sequence models to more effectively capture latent preferences and deliver superior recommendation performance. Among the sequence-based baselines, PLSPL, LSTPM, and PG2Net are representative LSTM-based variants that explicitly model users’ sequential mobility patterns. These LSTM-based methods demonstrate competitive performance by effectively modeling sequential dependencies in user trajectories. However, their improvements remain limited by relying primarily on single-view sequential information, which restricts their ability to exploit high-order relations and complementary signals from multiple perspectives. At a more advanced modeling level, Diff-POI disentangles the effects of geographical constraints and sequential user interactions, thereby achieving substantial improvements over other baselines across three different evaluation metrics. Building upon this line of research, our proposed MRHL framework, which explicitly separates hypergraphs from three distinct views, further surpasses all baseline methods. Collectively, these results demonstrate the importance and necessity of constructing multi-view representations through explicit disentangled learning for effectively modeling user—POI relationships.
In the context of disentangled multi-view representation learning, MRHL exhibits a clear advantage over the second-best model Diff-POI. Particularly, it improves the MRR by 11.70% on the NYC dataset, highlighting its superior capability in capturing user preferences. This performance advantage can be attributed to three key factors. First, MRHL effectively captures non-consecutive POI information in user trajectories. For instance, the time decay hypergraph goes beyond adjacent transitions and models the complete transition patterns. Second, by leveraging the hypergraph structure, MRHL is more capable of capturing high-order signals than GNN-based Diff-POI. Hyperedges can simultaneously connect multiple nodes, thereby alleviating data sparsity and oversmoothing while enriching semantic representations. Finally, the proposed contrastive learning framework effectively enhances the complementary effects across different views in a self-supervised manner, thereby further strengthening the overall recommendation performance and improving the robustness of user preference modeling.
Most next POI recommendation models achieve relatively high performance on dense urban datasets such as NYC and TKY, but their effectiveness decreases significantly on sparse datasets like Gowalla, where user check-ins are distributed globally. This discrepancy arises because dense datasets provide continuous and localized trajectories that facilitate sequential transition modeling and neighborhood aggregation. In contrast, the sparsity and discontinuity in Gowalla make it difficult to capture such patterns. However, our MRHL model exhibits a more stable performance gain on the sparse Gowalla dataset. By constructing multiple relational hypergraphs from complementary perspectives, MRHL is able to alleviate trajectory sparsity by modeling high-order associations among POIs and users. Moreover, the proposed multi-relational contrastive learning strategy encourages consistency across different views, enabling the model to extract robust signals even when sequential information is limited. As a result, MRHL achieves consistently superior performance on Gowalla compared with representative trajectory modeling and relational learning approaches, demonstrating its effectiveness in sparse and globally distributed scenarios.
In summary, the outstanding performance of MRHL across diverse datasets demonstrates its effectiveness and broad applicability in next POI recommendation. By constructing multi-view hypergraphs and incorporating contrastive learning, the model can more precisely capture the complex structures and relational patterns within user-POI interaction networks. This enables the learning of more accurate and personalized user preferences and node representations, ultimately leading to significant improvements in recommendation performance. It is worth noting that the evaluated datasets exhibit substantially different characteristics in terms of spatial scale, data sparsity, and user mobility patterns. Despite these differences, MRHL consistently achieves superior performance across all datasets, indicating that the proposed framework is not tailored to a specific controlled setting but can effectively adapt to more complex and dynamic recommendation scenarios.
5.3. Ablation Study
5.3.1. Impact of Different Components
To evaluate the impact of each component in MRHL, we conduct ablation studies on three datasets to analyze their respective contributions. The complete MRHL is regarded as the base model, from which four different variants are derived by removing specific components.
w/o P-HG: This variant removes the geographical proximity hypergraph module.
w/o F-HG: This variant removes the interaction frequency hypergraph module.
w/o D-HG: This variant removes the time decay hypergraph module.
w/o CL: This variant removes contrastive learning between different views.
The results of the ablation study are presented in
Table 4, from which the following conclusions can be drawn. Throughout all datasets, the full MRHL model maintains consistently superior performance across all metrics, demonstrating its robustness and effectiveness. On both the high-density NYC and medium-density TKY datasets, removing the geographical proximity hypergraph leads to the largest performance drop, underscoring the dominant role of spatial correlations in improving recommendation accuracy. The exclusion of the interaction frequency hypergraph also results in noticeable degradation, particularly on TKY, indicating the importance of visit frequency for modeling in medium-density scenarios. In contrast, removing the time decay hypergraph or contrastive learning module results in only a minor performance drop. This indicates that although temporal dynamics and contrastive signals provide additional benefits, their importance is less pronounced than that of spatial and frequency information in relatively dense datasets.
The ablation study on the Gowalla dataset reveals that removing the time decay hypergraph leads to the most significant performance degradation. This highlights that in the lowest-density scenario, user behaviors exhibit strong sequential dependencies, and modeling sequential transitions plays a central role in capturing user interests. In contrast, eliminating the geographical proximity and interaction frequency hypergraphs also results in notable performance decreases, indicating that spatial constraints and interaction intensity remain effective auxiliary factors. The exclusion of the contrastive learning module produces only a relatively minor decline, suggesting that its primary contribution lies in enhancing representation robustness.
In summary, the ablation results indicate that the modules of MRHL exhibit complementary effects across different datasets. Spatial correlations play a dominant role in high-density scenarios, while time decay is crucial in sparse environments. Meanwhile, interaction frequency and contrastive learning further enhance the robustness of the model under diverse data conditions.
5.3.2. Impact of Different Fusion Strategies
To further validate the effectiveness of the proposed cascaded fusion strategy, we conduct a comparative study against two widely used approaches, namely, linear fusion and adaptive fusion.
Figure 2 presents the performance comparison of the three fusion strategies on the NYC, TKY, and Gowalla datasets.
On the NYC dataset with the highest density, all fusion methods achieve strong performance. Nevertheless, cascaded fusion consistently outperforms both linear and adaptive fusion across all evaluation metrics. This suggests that even simple fusion remains competitive in dense data, while the cascaded design offers additional advantages. For the TKY dataset with moderate density, adaptive fusion already surpasses linear fusion, reflecting the benefit of dynamically adjusting the contributions of different components. However, cascaded fusion further improves performance on all metrics, indicating that the cascaded strategy enhances robustness and more effectively leverages intermediate signals. On the sparser Gowalla dataset, the performance gaps among methods become most pronounced. Linear fusion performs the weakest, while adaptive fusion yields moderate improvements. In contrast, cascaded fusion achieves the highest scores across all evaluation metrics. These results highlight that the cascaded design is especially effective in sparse data, where higher-order collaborative signals are crucial.
5.4. Hyperparameter Analysis
Theoretical analysis alone is insufficient to identify the optimal hyperparameter configuration. Therefore, we conduct sensitivity experiments to assess the impact of different hyperparameter settings on model performance, providing a basis for model optimization and tuning. Given the complex dependence of the MRHL model on the number of hypergraph convolution layers and the temperature in contrastive learning, this subsection focuses on the analysis of these two key hyperparameters.
5.4.1. Sensitivity to the Number of Layers
To explore the impact of the hypergraph convolutional layer, we conducted a hyperparameter study on the number of layers
.
Figure 3 depicts the performance trends with varying layer numbers on three datasets. As the depth increases from 1 to 3, all metrics (Recall@5/10 and NDCG@5/10) consistently improve, demonstrating the benefit of capturing higher-order dependencies through deeper propagation. However, beyond three layers, the curves either slightly decline or plateau, suggesting that excessive depth introduces redundant information and oversmoothing, which undermine discriminative capacity. This suggests that
achieves a favorable balance between representational capacity and avoidance of oversmoothing and noise accumulation; thus, we use
as the default depth in subsequent experiments.
5.4.2. Sensitivity to Temperature
To investigate the sensitivity of the contrastive learning to the temperature
, we vary
.
Figure 4 presents the Recall@5/10 and NDCG@5/10 curves across three datasets, illustrating the effect of the temperature
on model performance. Overall, the curves remain relatively smooth, indicating that the model is not highly sensitive to
, but clear differences can still be observed. On TKY, the performance consistently increases from
to
, reaching its peak at
, with the highest Recall@5/10 and NDCG@5/10 values, after which the curves show a gradual decline as
further increases. A similar trend is observed on NYC, where the best results are also achieved at
, followed by a slight decrease. On Gowalla, the curves fluctuate mildly, but the best performance is observed near
, with larger
values leading to degradation. These results demonstrate that a properly tuned
is essential for effective contrastive learning. Extremely small or large values tend to produce suboptimal embedding representations, whereas moderate values lead to more robust and discriminative performance.
5.5. Sparsity Analysis
To investigate whether the proposed model can alleviate the data sparsity issue in next POI recommendation, we design two types of sparsity evaluation experiments. The first experiment is conducted at the user level. Users are divided into three groups based on the number of complete trajectories in the training set, with the top 30% considered active users, the bottom 30% considered inactive users, and the rest are normal users. Moreover, all samples belonging to the same user in the test set inherit that user’s activity category. The second experiment is sample-level, where groups are determined by the trajectory length of each test sample. Specifically, the top 30% of samples are classified as active samples, the bottom 30% as inactive samples, and the remaining as normal samples. Unlike the user-level setting, this allows samples from the same user to fall into different groups.
As shown in
Figure 5, the user-level results reveal that active users consistently achieve the best performance across all datasets, while inactive users perform the worst, highlighting the challenge of modeling long-term inactive users. Normal users are situated between the two extremes, reflecting the model’s ability to partially generalize to users with moderate interaction histories. Moreover, the performance degradation is more pronounced on the Gowalla dataset, underscoring the severe challenges posed by extreme sparsity. These results indicate that the proposed model can effectively leverage rich interaction histories, while performance degradation under extreme user-level sparsity remains a common challenge for next POI recommendation models. Nevertheless, MRHL still maintains stable relative performance across datasets, suggesting a certain degree of robustness even in highly sparse and irregular user behavior settings.
As shown in
Figure 6, the sample-level results exhibit a trend distinct from the user-level setting. Normal samples generally achieve comparable or even superior performance to active samples. For instance, in the NYC and TKY datasets, normal samples slightly outperform active ones across all metrics, suggesting that the trajectory length of individual samples plays a more direct role in model prediction. Moreover, inactive samples also demonstrate relatively stable performance in this scenario, indicating that the model can effectively capture sequential information even without rich historical records. This suggests that MRHL is less dependent on long sample trajectories and can generalize to dynamic scenarios where only short-term or partial behavioral contexts are available. Overall, these findings suggest that MRHL can effectively leverage short-term trajectory information to alleviate sample-level sparsity and support robust preference learning in dynamic scenarios.
In summary, both experiments consistently demonstrate that although the model continues to be constrained by sparsity, it nevertheless exhibits notable robustness and stability in handling sparse sequences. This advantage can be attributed to its effective exploitation of high-order association signals, such as complex inter-user similarities, which facilitate the transfer of knowledge from active to inactive users. Consequently, even users with sparse trajectories can leverage richer association signals, thus leading to improved recommendation quality under highly sparse conditions. More importantly, these sparsity analysis results provide empirical evidence that MRHL can generalize beyond idealized or densely observed settings. Instead of relying solely on dense historical interactions, the proposed framework leverages multi-view hypergraph structures and high-order relational modeling, enabling it to remain effective and robust to different trajectory patterns in complex and dynamic environments characterized by sparse, incomplete, or rapidly evolving user behaviors. It is worth noting that our sparsity analysis focuses on highly sparse scenarios, rather than strict cold-start cases involving entirely unseen users or POIs. Handling completely unseen entities would require incorporating additional side information or designing inductive learning mechanisms, which is beyond the current scope of this study. Extending the proposed framework to such inductive settings remains an important direction for future work.
5.6. Efficiency Analysis
To evaluate the computational efficiency and practical feasibility of the proposed MRHL model, we conduct experiments on the Gowalla dataset. Among the datasets utilized in our study, Gowalla features the largest scale and the highest data sparsity, representing the most challenging scenario for computational costs. This rigorous setting allows us to thoroughly assess the scalability of different models in a realistic and demanding recommendation environment. We compare MRHL with several representative baseline methods that span diverse modeling paradigms and exhibit distinct computational characteristics. Specifically, these baselines include the LSTM-based PG2Net, the spatio-temporal graph neural network STGCN, and the diffusion-based Diff-POI.
As shown in
Table 5, MRHL achieves a favorable trade-off between model complexity and efficiency. Specifically, MRHL contains 3.16 M parameters, which is significantly smaller than PG2Net (19.84 M) and comparable to STGCN (3.96 M), indicating a relatively compact model design. In terms of training efficiency, MRHL requires 42.96 min per epoch, which is notably faster than PG2Net and Diff-POI. This demonstrates that MRHL maintains highly competitive training efficiency despite the intricate modeling of multi-relational hypergraph structures. For inference, MRHL achieves a low latency of 3.26 ms. While marginally slower than the lightweight STGCN, this inference speed is highly efficient and easily satisfies the strict low-latency requirements of real-world recommendation systems, especially given MRHL’s superior capacity to capture high-order collaborative signals.
Regarding GPU Memory consumption, MRHL utilizes 9568 MB, which is higher than the baseline models. This increased memory footprint is an acceptable trade-off for constructing multi-relational hypergraphs, which naturally require storing extensive high-order incidence matrices. Nevertheless, this memory cost remains well within the capacity of mainstream commercial GPUs (e.g., 11 GB or 12 GB VRAM), ensuring that MRHL maintains a practical balance between sophisticated modeling capability and real-world deployability.
Overall, the results demonstrate that MRHL achieves a strong efficiency-performance balance under the most challenging settings. It remains computationally feasible while effectively capturing complex multi-relational dependencies, highlighting its scalability and suitability for real-world large-scale POI recommendation scenarios.
6. Discussion
Our empirical evaluations systematically demonstrate the superiority of the proposed MRHL framework. Our experimental results show that effectively disentangling multi-view behaviors plays a critical role in improving next POI recommendation. Specifically, the ablation studies confirm that our cascaded fusion strategy and cross-view contrastive learning module are fundamental drivers in capturing complementary behavioral signals and mitigating data sparsity.
Beyond the empirical performance, our findings offer important theoretical insights. They also contribute to the broader literature on graph-based recommendation systems. As discussed in
Section 2, many existing approaches, including unified graph models and homogeneous hypergraphs (e.g., [
18,
22]), are inherently limited due to their reliance on confounded user representations. Our work explicitly disentangles interaction frequency, time decay, and geographical proximity, thereby validating the hypothesis from recent multi-intent studies (e.g., [
24,
25]) that factorized modeling improves interpretability. Moreover, our methodology significantly advances this paradigm. Instead of relying on independent auxiliary signals or shallow fusion strategies [
24,
25], MRHL leverages cross-view synergistic reinforcement to address long-standing sparsity and integration issues in mobility forecasting. These findings further confirm and extend prior studies on multi-view learning, demonstrating that effective cross-view interaction is not only beneficial but essential for robust representation learning in complex recommendation scenarios.
From a practical perspective, the implications of this study extend beyond next POI recommendation to related fields. The ability to extract fine-grained user intents from mobility trajectories provides significant value for Location-Based Service (LBS) providers, allowing them to deliver more context-aware and personalized marketing strategies. Furthermore, the large-scale mobility patterns derived from our multi-relational hypergraphs can be effectively applied to smart city planning and urban computing. For instance, urban planners and transportation authorities can leverage these disentangled behavioral patterns to optimize infrastructure allocation, improve traffic management, and support the development of more sustainable urban environments.
7. Conclusions
In this paper, we propose a novel POI recommendation model named MRHL. The model leverages disentangled hypergraphs, namely, interaction frequency, time decay, and geographical proximity, to comprehensively capture the complex intrinsic correlations among POIs from multiple views. Specifically, the frequency hypergraph reflects users’ preference intensity toward different locations, the decay hypergraph characterizes the transitional relations in evolving user sequences, and the proximity hypergraph captures the influence of spatial correlations on user behaviors. By enhancing the hypergraph propagation process and designing a cascaded fusion strategy, MRHL enriches POI embeddings and integrates multi-view user–POI relations more effectively. In addition, we introduce cross-view contrastive learning to capture complementary effects and strengthen the discriminative power of representations, thereby improving robustness and generalization under sparse or noisy conditions.
In future work, we will focus on systematically uncovering the latent intentions embedded in user–POI interactions, aiming to better capture the implicit information of user decision-making in POI recommendation. By disentangling the different driving factors of user behaviors, such as functional needs, social influence, and contextual preferences, we aim to obtain a more precise and finer-grained representation of user decision-making. This direction is expected to not only improve the accuracy of next POI recommendation but also enhance the interpretability of the model, enabling clearer explanations for recommendation outcomes. We also plan to explore ways to leverage disentangled representations to enhance the system’s robustness in sparse or noisy environments, ultimately contributing to the development of more reliable and user-friendly recommendation systems. Despite the promising results, we acknowledge several limitations in our current study, which naturally pave the way for further extensions. First, while our hypergraphs effectively capture complex correlations, they inherently treat nodes within a hyperedge as an unordered set. To explicitly emphasize the strict sequential nature of trajectories, future studies could explore incorporating Temporal Position Encoding into the hypergraph nodes. Second, another structural consideration is that the current model assumes fixed relationships within the hypergraphs, which may limit its ability to handle dynamic sparsity. Incorporating an attention mechanism into hyperedge convolutions to dynamically weigh POIs under specific contexts (e.g., time of day) is a promising direction to improve intent accuracy. Lastly, we acknowledge that our current multi-relational hypergraph is constructed based on a fixed set of users under a transductive setting. In real-world scenarios with dynamic user populations, the system relies on periodic offline retraining. Therefore, another important direction for future work is to explore inductive hypergraph learning and dynamic graph update mechanisms. This allows the model to incrementally incorporate new users without full graph reconstruction, enabling adaptation to dynamic recommendation scenarios.