3.1. Problem Formulation
We define the user set U = {, , …, }, the location set l = {, , …, }, and the POI category set C = {, , …, }. Then we define some concepts used in this paper.
Definition 1 (Check-in)
. A check-in record is a tuple q = (, , , ), indicating that user visited location of category at time .
Definition 2 (Session)
. We define all check-in records generated by a user as = (, , , …), where represents the i-th check-in record of user U. We divide the check-in records of a user within a certain period of time into a session S, and the length of each session S may vary.
Definition 3 (Next-POI Recommendation)
. We define next-POI recommendation as follows: given a user and its historical check-in records , the task is to recommend the top-K points of interest that the user is likely to visit at the next timestamp.
3.2. Multi-Branch Attention Fusion Network
The structure of the Multi-Branch Attention Fusion Network is shown in
Figure 2, where
N is set to 2. We use light-colored solid figures to represent the original embedded data, with different colors representing different embedding modalities, and then use dark-colored solid figures to represent the feature-enhanced modalities obtained after capturing features through each branch. The Multi-Branch Attention Fusion Network consists of three branches—POI Branch, Time Branch, and Category Branch—and a final Branch Fusion module. It first splits various embedded information into different branches, each focusing on a specific modality, and then fuses them together.
From human trajectory data, we can learn a great deal about human movement patterns, but due to the sparsity of trajectory sequences, we adopted the embedding method described in [
9]. For user ID, location, and POI category information, based on word2vec [
41], we map these sparse data into low-dimensional feature dense vectors, represented as
,
, and
, respectively. Here,
denotes the number of users,
denotes the number of locations,
denotes the number of location categories, and
,
, and
represent the embedding dimensions of the corresponding features. For timestamp information, since it is continuous and cannot be directly embedded, we first divide the 24 h of a day into 24 time segments from 0 to 23 to represent weekdays, and then use 24 time segments from 24 to 47 to represent weekends, thereby distinguishing between weekdays and weekends. We then map the timestamps to the corresponding time segments and finally encode them, represented as
, with a dimension of
.
In previous studies, there has been a lack of further capture of embedded features, often simply concatenating various embedded information through various methods. While progress has been made at the basic feature representation level, the modal differences between multi-source heterogeneous features have been widely overlooked. Furthermore, this coarse-grained feature mixing method, which fuses embedded vectors such as geographic coordinates, timestamps, and POI categories, through simple concatenation or weighted summation, can lead to two issues: first, the semantic orthogonality of features across different dimensions is disrupted, causing temporal–spatial patterns and semantic information to become feature-confused; second, the fine-grained associative patterns unique to each dimension are difficult to model in a targeted manner. To address this issue, we designed the Multi-Branch Attention Fusion Network module. Liu et al. [
42] proposed a multi-behavioral sequential recommendation model called MAINT, which makes recommendations to users by extracting different preferences from target behaviors. This model has achieved significant results, proving the feasibility of our method.
POI Branch is a component specifically designed to extract geospatial correlation features in multi-branch spatiotemporal modeling networks. This module employs a deep attention mechanism [
43,
44] to explicitly model the spatial dependencies between POIs in user movement trajectories, including geographic proximity, regional functionality, and potential spatial access patterns. Its design objective is to address the limitations of traditional sequence models in modeling local geographic context, providing fine-grained spatial semantic representations for personalized location recommendations.
The input to this module is the raw embedding
of the POI sequence, denoted here as
X. First, it goes through two layers of multihead attention. In each layer, multiview feature projection is performed to generate the Query, Key, and Value matrices:
Among them,
are learnable parameters. Then, the attention score is obtained through the scaled dot-product attention operation, with the specific formula as follows:
where
H represents the number of attention heads. In addition, layer normalization and residual connections are applied within the attention module. The formula is as follows:
When people select the next POI, they often exhibit path dependency in their movement trajectories, such as having fixed commuting routes. For example, most people have a fixed route from home to the subway station and then to the office. Given this phenomenon, we use dot-product attention operations to adaptively learn the transition probabilities between POIs, suppressing low-probability unreasonable transitions and capturing regular patterns in users’ historical visit paths. We have set up two attention layers. The first-layer attention primarily focuses on adjacent POIs in the sequence, modeling local interactions to represent neighboring relationships. The higher-layer attention receives the local features from the lower layer’s output, utilizing residual connections and feature propagation. The second layer can access a broader range of POIs, expanding the receptive field to capture regional functional features.
After multi-level attention refinement, the module further integrates multi-level spatial semantics through a feedforward network. The feedforward network extracts high-order spatial interaction information through expansion–compression dimensional operations, suppresses noise, and then adds the FFN output to the final features of the attention layer via residual connections to mitigate the vanishing gradient problem in deep networks, ensuring training stability. It then performs layer normalization again to ensure that the output feature scale is consistent with the remaining branches of the multi-branch network. The specific formula is as follows:
where
and
are trainable weight matrices,
and
are two bias parameters, and the dimension of the final output
H is the same as that of the input
X,
. For POI recommendation tasks, SpatialBranch significantly enhances the consistency of path prediction and avoids unreasonable candidate targets.
The architectural design of the Time Branch and Category Branch is largely similar to that of the POI Branch, with each branch adopting a hierarchical structure consisting of multihead attention mechanisms, feature enhancement, and hierarchical fusion. For the Time Branch, the lower-level time attention focuses on local continuity, while the higher-level attention integrates global periodicity. The Category Branch reinforces the semantic association modeling of POI categories through a hierarchical feature learning mechanism. This branch focuses on fine-grained category interaction patterns in the first-layer attention, enabling POIs with similar functional attributes to obtain higher association weights even if their physical locations are dispersed, thereby capturing users’ consumption intentions. The higher-level attention identifies regional functional combination features by expanding the receptive field. This module breaks through the limitations of pure distance through semantic association modeling, providing theoretical support for cross-regional same-category recommendations and cross-category functional combination recommendations, and suppressing redundant phenomena of excessive concentration of a single category in recommendation results.
After capturing the features of each branch, we achieve hierarchical fusion of multi-branch features through BranchAttentionFusion module with static and dynamic dual pathway design. When a new POI lacks historical data, the pure attention mechanism may fail due to noise or sparse data, and we obtain static weights to provide the base recommendation. In the static pathway, the module assigns learnable normalized weight parameters to each branch, generates probabilistic weight distributions via Softmax, and then weights and sums the output features of POI, Time, and Category Branches. To capture the complex behavioral patterns of users, we design a dynamic pathway to dynamically capture fine-grained semantic associations across branches using an attention mechanism. Based on the physical foundation of POI recommendations based on spatial location, user movement is strictly constrained by distance. Dynamic pathways use spatial branch features as the query benchmark, then project each branch feature as a key and value, concatenate them to form a global context, and then calculate the attention score through scaling dot-product attention to dynamically fuse the semantic information of each branch. Finally, the module sums the projected output of the dynamic pathway with the static weighted features through residual concatenation, and applies layer normalization to eliminate the feature scale differences to output the fused unified representation. The specific process is as follows:
In the equation, N denotes the number of branches (in this work ), where represents the output tensor of the i-th branch. Here, B indicates the batch size, L stands for the length of historical behavior sequence, and D is the unified feature dimension. The symbol denotes the branch static weight parameters. The matrices , , and are linear transformation matrices for generating queries, keys, and values, respectively. represents the weighted sum result from the static pathway, while denotes the dynamically fused result through attention mechanisms in the dynamic pathway. constitutes the ultimate output, formed by layer-normalized integration of both static and dynamic fusion results.
3.3. Adaptive Spectral Gate
Inspired by [
33], we proposed an Adaptive Spectral Gate module based on frequency-domain signal processing to address the coexistence of periodic patterns and random noise in users’ mobile behavior. The specific structure of this module is shown in
Figure 3. The colored blocks in the figure represent frequencies from low to high, from bottom to top. We use dark red to indicate higher weights, with the color gradually lightening to light blue and then darkening to dark blue to indicate weights from high to low. This module projects user behavior sequences into the frequency domain using fast Fourier transform (FFT) and performs global spectral modulation using learnable adaptive filters. The design also provides time–frequency dual-domain collaborative modeling capability to effectively decouple long-term behavioral patterns from short-term random fluctuations.
This is accomplished by first performing a real fast Fourier transform on the input time-domain signal along the time dimension:
The input is a time-domain user behavior sequence
, where
B is the batch size,
L is the sequence length, and
D is the feature dimension. Output frequency-domain complex signal
(
) retains the non-redundant frequency components after conjugate symmetry to reduce computational effort, and orthogonal normalization is used to ensure energy conservation. This decomposes the user behavior sequence into different frequency components, with low frequencies corresponding to long-term patterns and high frequencies corresponding to short-term fluctuations. Since high-frequency signals usually represent rapid fluctuations that deviate from the underlying trend, this can make the data more random and difficult to interpret [
45]; we propose a frequency-domain adaptive filter that generates dimension-independent filters through a trainable complex weight matrix, constraining the real and imaginary parts of each dimension to the (0, 1) interval using the Sigmoid function. Define the learnable complex filter
whose real and imaginary parts of each dimension
d are parameterized by the weights after Sigmoid activation, respectively:
where
is a Sigmoid function that learns the real and imaginary weights of the parameter matrices
and
corresponding to the feature dimension
d. The weights are then applied to the frequency-domain signal element by element. Then complex multiplication is applied to the frequency-domain signal element by element:
We expand the result into real and imaginary part weights:
Finally, we reconstruct the time-domain signals by inverse transformation and keep the length of the output sequence consistent with the input:
We obtain the result
. To further optimize the feature characterization, the module further introduces a difference-driven attention gating mechanism to generate dynamic weights from the feature differences before and after spectral processing:
In the above equation, if
, it represents that the frequency-domain enhancement is effective and amplifies the contribution of the dimension, and if
, the frequency-domain processing is distorted and attenuates the impact of the dimension. This is followed by a dual-domain synergy with the multihead time-domain attention module:
The final output of the time–frequency synergy feature is obtained by residual concatenation:
The final result is .
This module adapts through backpropagation by jointly optimizing the trainable parameter with other parts of the network, enabling it to adaptively amplify task-relevant global spectral patterns and suppress noise-dominated spectral components. Specifically, if high-frequency components are generally associated with noise, this mechanism automatically weakens the spectral weights of the corresponding features. Overall, this module indirectly emphasizes low-frequency long-term patterns and suppresses high-frequency short-term noise in user behavior sequences by utilizing learnable global modulation combined with data-driven frequency-domain attention and time-domain multihead attention.