Next Article in Journal
From Symmetry to Semantics: Improving Heritage Point Cloud Classification with a Geometry-Aware, Uniclass-Informed Taxonomy for Random Forest Implementation in Automated HBIM Modelling
Previous Article in Journal
Quantum Coherence and Mixedness in Hydrogen Atoms: Probing Hyperfine Structure Dynamics Under Dephasing Constraints
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Temporal-Aware and Intent Contrastive Learning for Sequential Recommendation

College of Electronic and Information Engineering, Anhui Jianzhu University, Hefei 230601, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(10), 1634; https://doi.org/10.3390/sym17101634
Submission received: 24 August 2025 / Revised: 15 September 2025 / Accepted: 20 September 2025 / Published: 2 October 2025
(This article belongs to the Section Computer)

Abstract

In recent years, research in sequential recommendation has primarily refined user intent by constructing sequence-level contrastive learning tasks through data augmentation or by extracting preference information from the latent space of user behavior sequences. However, existing methods suffer from two critical limitations. Firstly, they fail to account for how random data augmentation may introduce unreasonable item associations in contrastive learning samples, thereby perturbing sequential semantic relationships. Secondly, the neglect of temporal dependencies may prevent models from effectively distinguishing between incidental behaviors and stable intentions, ultimately impairing the learning of user intent representations. To address these limitations, we propose TCLRec, a novel temporal-aware and intent contrastive learning framework for sequential recommendation, incorporating symmetry into its architecture. During the data augmentation phase, the model employs a symmetrical contrastive learning architecture and incorporates semantic enhancement operators to integrate user preferences. By introducing user rating information into both branches of the contrastive learning framework, this approach effectively enhances the semantic relevance between positive sample pairs. Furthermore, in the intent contrastive learning phase, TCLRec adaptively attenuates noise information in the frequency domain through learnable filters, while in the pre-training phase of sequence-level contrastive learning, it introduces a temporal-aware network that utilizes additional self-supervised signals to assist the model in capturing both long-term dependencies and short-term interests from user behavior sequences. The model employs a multi-task training strategy that alternately performs intent contrastive learning and sequential recommendation tasks to jointly optimize user intent representations. Comprehensive experiments conducted on the Beauty, Sports, and LastFM datasets demonstrate the soundness and effectiveness of TCLRec, where the incorporation of symmetry enhances the model’s capability to represent user intentions.

1. Introduction

In contrast to conventional recommendation systems [1,2], sequential recommendation (SR) systems [3,4] take into account the ordering of items in interaction sequences. By dynamically modeling users’ historical behaviors, SR systems predict the items that users are most likely to be interested in next. Sequential recommendation systems have found extensive applications across multiple domains, particularly in e-commerce platforms, multimedia services (movies and music), and digital content delivery (news platforms). Its core objective is to extract dynamic user interest patterns from complex behavioral sequences (e.g., clicks, purchases, and browsing orders) and achieve cross-scenario personalized recommendations.
In early works, researchers relied on Markov chains (MCs) [5] or Factorization Machines [6,7] to model low-order relationships between sequential items. Subsequently, Recurrent Neural Networks (RNNs) [8] and Gated Recurrent Units (GRUs) [9] were introduced into the field of sequential recommendation to more effectively model user interactions. With the emergence of Transformer [10,11] as a prominent research focus, self-attention mechanisms have been successfully incorporated into SASRec [12] and BERT4Rec [13] to effectively capture item–item interactions. The widespread adoption of Graph Neural Networks (GNNs) has demonstrated their effectiveness, with several GNN-based approaches, such as MB-GCN [14] and KHGT [15], learning transition relationships between items through message passing over graph-structured item representations. Furthermore, FMLP-Rec [16] enhances sequential modeling efficiency by incorporating noise filtering for user behavior sequences and adopting a simplified Multilayer Perceptron (MLP) architecture, thereby effectively capturing long-range dependency patterns in sequential data. To further enhance the performance of sequential recommendation systems, researchers have begun to incorporate self-supervised learning (SSL) [17,18,19,20,21] into the recommendation paradigm. SSL is a paradigm that derives supervisory signals from unlabeled data, reducing the need for manual annotation. In sequential recommendation, SSL commonly leverages pretext tasks to learn meaningful sequence representations. Typical strategies include contrastive learning [22,23,24], which aligns semantically similar sequences while separating dissimilar ones, and data augmentation techniques that generate diverse sequence views to enhance model robustness. S3-Rec [17] incorporates contextual information into sequential recommendation through mutual information maximization, thereby further enhancing sequential representation learning. CL4SRec [18] implements three sequence-specific data augmentation techniques to construct positive view pairs for contrastive learning during model optimization. CoSeRec [19] introduces two novel augmentation operators that perform replacement and insertion operations by computing item correlations, aiming to mitigate data noise and alleviate the cold-start user problem. However, it still relies on random data augmentation strategies and fails to incorporate item features when computing item correlations. DuoRec [20] advances contrastive learning in sequential recommendation by introducing both supervised and unsupervised objectives at the model level. The development and application of these methods have significantly enriched the research landscape in sequential recommendation, offering novel perspectives for enhancing recommendation system performance.
As research advances, scholars have increasingly recognized that relying solely on superficial patterns in user behavior sequences may be insufficient to capture users’ latent preferences, as purchase behaviors are fundamentally governed by underlying intent. DSSRec [25] employs a sequence-to-sequence (seq2seq) training strategy to infer user intent through clustering of individual sequence representations. SINE [26] introduces a sparse interest extraction module that adaptively infers user interaction intentions from a diverse set of intent groups. With the advancement of contrastive learning techniques, researchers [27,28,29,30] have begun leveraging contrastive learning tasks to develop sequence recommendation models that incorporate user intent. ICLRec [27] extracts intent prototype representations through clustering in the embedding space of all user behavior sequences. It further constructs contrastive learning tasks to integrate the learned intents into the sequential recommendation framework. IOCRec [28] employs contrastive learning to extract multi-intent representations from randomly augmented views, with performance further enhanced through noise mitigation strategies. ELCRec [30] integrates behavior representation learning into an end-to-end clustering framework, thereby enhancing the overall performance of the model. Although incorporating user intent information enables a more comprehensive exploration of behavioral patterns and interest preferences, intent-aware contrastive learning approaches for sequential recommendation still face several critical challenges.
Current random data augmentation techniques (e.g., cropping, masking, and reordering) expand sequences through random item perturbation. However, their view generation process remains unoptimized for sequence correlation representation, which may compromise the original semantic structure and hinder models from capturing genuine user intent. Due to the long-tailed distribution of sequence lengths, shorter sequences exhibit higher sensitivity to random perturbations. Operations such as masking may exacerbate information loss in short sequences, consequently aggravating the cold-start problem. Furthermore, current contrastive learning methods typically treat user behavior sequences as static sets, overlooking their temporal characteristics. Since these sequences inherently encapsulate both short-term and long-term interests, effectively learning temporal intervals and ordering between sequences is crucial for accurate user intent modeling. However, user behavior data often contain noise and irrelevant items. Existing models primarily relying on self-attention mechanisms are especially vulnerable to such noise interference, potentially inducing overfitting and consequently impairing accurate modeling of behavioral dynamics and global patterns.
To address these challenges, we propose TCLRec, a novel temporal-aware and intent contrastive learning framework for sequential recommendation. The key contributions of this work are summarized as follows:
  • In terms of data augmentation, a semantic enhancement module based on user preferences is designed to improve existing random augmentation methods. On one hand, two semantic augmentation operators—semantic replacement and semantic insertion—are proposed. By incorporating user rating factors during augmented view construction, these operators generate enhanced sequences that better align with user preferences. On the other hand, a hyperparameter is introduced to dynamically distinguish between long and short sequences, controlling the supplemental intensity of additional self-supervised signals to the original user interaction sequence. This effectively mitigates sequence length skewness, thereby improving the confidence of positive sample pairs.
  • For sequence modeling, a model-enhanced temporal-aware network is constructed to replace the original Transformer architecture. First, a frequency filter is introduced to adaptively attenuate noisy information in the frequency domain, capturing global features and long-term dependencies in user behavior. Second, an auxiliary temporal dependency modeling encoder is introduced to mitigate the embedding collapse problem caused by using a single sequence encoder, while simultaneously capturing short-term interest dynamics in user behavior. The complementary enhancement between the two modules enables the fusion of diverse temporal dependencies from multiple perspectives, thereby strengthening the temporal perception capability for both long- and short-term user behavior patterns under global patterns.

2. Task Definition

2.1. Problem Formulation

Consider a recommendation system with a set of users U and a set of items V. Each user u U has an item interaction sequence S u = v 1 u , v 2 u , , v t u , , v n u , where n denotes the length of the interaction sequence of user u, and v t u represents the item interacted with by user u at time step t. If the length n of a user’s historical interaction sequence exceeds a predefined maximum length T, only the most recent T interactions are considered. If the length n is less than T, padding items are added at the beginning of the sequence to ensure its length reaches T. For each user u, given the sequence S u , the objective is to predict the item v n + 1 u that the user is most likely to interact with at the next time step (i.e., time step n + 1).

2.2. Sequential Recommendation Task

The core objective of sequential recommendation is to predict a user’s next potential interaction item by learning relationships among items in the given user–item interaction sequence through an encoder. For the sequence encoder f θ ( · ) , it encodes a user’s historical behavior sequence S u into interest representations H u = f θ ( S u ) , where h t u H u denotes a user’s interest representation at time step t. Our goal is to identify optimal encoder parameters θ * by maximizing the log-likelihood of users’ future behaviors, as formulated in Equations (1)–(3).
θ * = argmax θ u = 1 N t = 1 T ln P θ ( v t u )
L N e x t I t e m = u = 1 N t = 1 T L N e x t I t e m u , t
L N e x t I t e m = log ( σ ( h t 1 u · v t u ) ) n e g log ( 1 σ ( h t 1 u · v n e g u ) )
where N denotes the batch size of the sequence model input, v t u represents the embedding of the target item, v n e g u denotes the embedding of items not interacted by user u, σ is the Sigmoid activation function, and P θ ( v t u ) represents the conditional probability that user u interacts with an item at time t given model parameters θ .

3. The Proposed Model

We propose a temporal-aware and intent contrastive learning for a sequential recommendation model, which mainly consists of three components: a semantic enhancement module, a temporal-aware network, and intent contrastive learning. The semantic enhancement module adopts a differentiated data augmentation strategy that incorporates user rating information, designing optimal augmentation methods tailored for both long and short sequences to comprehensively learn semantic information from user behavior sequences. The proposed approach then employs learnable filters to eliminate noise interference and extract global temporal features. During the pre-training phase, we integrate frequency-domain filtering algorithms with temporal collaborative signals to construct the temporal-aware network, which further captures short-term dependencies in sequences and dynamic evolution patterns of user behaviors. The model captures user intentions from all user behavior sequences using the K-means clustering algorithm. Within a generalized Expectation Maximization (EM) framework, the optimization steps of intention representation learning and intention contrastive learning are performed alternately, thereby maximizing the consistency between sequence features and their corresponding intentions to guide the prediction of the next item. The overall architecture of TCLRec is illustrated in Figure 1.

3.1. Semantic Enhancement Module

The semantic enhancement module focuses on preserving the semantic integrity of user behavior sequences. Its core lies in integrating explicit user preferences with implicit behavioral patterns. Through a joint dynamic augmentation strategy and contrastive learning task, the vector representations of the initial enhanced user semantic sequences serve as contrastive targets for sequence SSL tasks, thus better capturing the semantic feature information of user interaction sequences. First, we systematically analyze conventional random data augmentation methods (including cropping, masking, and reordering) and their limitations for sequence modeling. Second, we propose two semantic augmentation operators that incorporate user rating information: semantic substitution and semantic insertion. Finally, by introducing a sequence-length-adaptive differential augmentation strategy, we achieve enhanced representation of latent semantic information in sequences.

3.1.1. Random Augmentation Operators

This section examines the implementation of three conventional random augmentation operators: Crop (C), Mask (M), and Reorder (R).
C: Starting from position i, a continuous subsequence of length l is randomly selected and extracted as formulated in Equation (4).
S u C = C ( S u ) = v i , v i + 1 , , v i + l 1
where i 1 , 2 , , n , S u C denotes the cropped subsequence, and v i + l 1 represents the last item in the cropped subsequence.
M: Random items in the sequence are masked and replaced with a special token [mask] to simulate missing data. The masked sequence S u M is computed as shown in Equation (5).
S u M = M ( S u ) = v 1 , v 2 , , v i , , v n
where v i denotes the transformed item representation. If v i is selected for masking, then v i = [ m a s k ] ; otherwise, v i = v i .
R: Starting from position i, a subsequence of length r ( v i , , v i + r 1 ) is randomly selected, and its items are shuffled to produce the reordered sequence S u R as given in Equation (6).
S u R = R ( S u ) = v 1 , v 2 , , v i , , v i + r 1 , , v n
where i 1 , 2 , , n , and v i , , v i + r 1 represents the shuffled subsequence.

3.1.2. Semantic Augmentation Operators

Although conventional augmentation operations, such as cropping, masking, and reordering, have been widely adopted, these methods may compromise the semantic integrity of sequences when generating augmented samples, particularly for short sequences [31]. We propose a semantic augmentation approach that systematically integrates user preferences through joint modeling of explicit ratings and implicit collaborative relationships to derive item-wise semantic representations. This generates more reliable and semantically enriched augmented samples while effectively preserving the core semantic features of sequences. Below, we elaborate on two semantic augmentation operators: semantic replacement and semantic insertion.
The substitution operation replaces items in the sequence with highly correlated counterparts, which minimally disrupts the original sequence information and generates more reliable positive sample pairs with higher confidence. Current algorithms for substitution items, such as collaborative filtering [32], simplistically assume that two items are correlated if they share many common users, thereby neglecting users’ genuine semantic preferences. To address this limitation, we propose a semantic substitution method that integrates rating information with collaborative filtering principles. Figure 2 illustrates the semantic substitution process. Given a user–item rating matrix, we compute the similarity between item pairs’ interaction for user sets using the Jaccard similarity coefficient. We then calculate the correlation of common user ratings between items using the Pearson correlation coefficient, which serves as a weighting factor. Additionally, we account for the influence of the number of common users between popular items on similarity. Finally, the item relevance is computed as specified in Equation (7).
C o r ( i , j ) = | ρ ( R i u , R j u ) | J ( i , j ) u 1 log ( 1 + | N ( u ) | )
where u is a user, C o r ( i , j ) denotes the relevance between items i and j, ρ ( R i u , R j u ) represents the rating correlation between items i and j by user u, N ( u ) denotes the number of common users between items i and j, and J ( i , j ) indicates the similarity of interaction user sets between items i and j. The term u 1 log ( 1 + | N ( u ) | ) serves as a dynamically adjusted weighting factor based on the number of common users, which mitigates the excessive influence of items with numerous common users on similarity computation and prevents popular items from dominating the similarity calculation.
Based on the above calculation results, we first select indices and randomly choose certain items in sequence or replacement. Then, using the user-preference-integrated item relevance computation method, we identify correlated items to substitute each selected position, thereby obtaining the semantically substituted sequence, as shown in Equation (8).
S u S S = S S ( S u ) = v 1 , v 2 , , v i , , v n
The semantic insertion operation enhances the model’s understanding of sequential semantic information by inserting highly correlated items into the sequence to construct augmented data. Figure 3 illustrates the semantic insertion process. Specifically, a randomly selected item in sequence is augmented by inserting its correlated item at the target index position, yielding the semantically inserted sequence, as formalized in Equation (9).
S u S I = S I ( S u ) = v 1 , v 2 , , v i , v i , , v n

3.1.3. Semantic Enhancement Strategy

Considering the length characteristics of user sequences, we categorize sequences into short and long sequences using a hyperparameter L as the threshold. Different combinations of augmentation methods are then applied to short and long sequences, with the selection approach mathematically formulated in Equation (10).
S u a = a ( S u ) , a { S S , S I , C } , n L a ( S u ) , a { S S , S I , M , R , C } , n > L
For short sequences, we adopt a conservative semantic augmentation set { S S , S I , C}, where S S and S I effectively increase sample diversity, while C serves as a weak-augmentation strategy. When applied appropriately, C facilitates the model’s capture of local features in user preferences while preventing information loss from excessive perturbation. For long sequences, we employ an information-rich set of augmentation operations { S S , S I , M, R, C}, including masking and reordering techniques, designed to enhance the model’s adaptability and generalization capability for nonlinear temporal transformations in complex behavioral sequences from multiple perspectives. Through a length-adaptive augmentation strategy, the model adjusts the intensity of semantic enhancement according to the properties of different input sequences.

3.1.4. Contrastive Learning

We employ contrastive learning to enhance the representation of semantic information by maximizing mutual information between different sequences [33]. Given a batch of sequences S u u = 1 N , where u 1 , 2 , , N , we apply two distinct augmentation operations to each sequence S u , as mathematically formulated in Equation (11).
S ˜ 1 , S ˜ 2 , , S ˜ 2 u 1 , S ˜ 2 u , , S ˜ 2 N 1 , S ˜ 2 N
where ( S ˜ 2 u 1 , S ˜ 2 u ) are treated as positive sample pairs, while all other pairs are considered negative samples. For each pair of sequence samples, their encoded representations ( h ˜ 2 u 1 , h ˜ 2 u ) are obtained, and the NT-Xent function is employed to formulate the sequence SSL loss function L S e q C L , as defined in Equation (12).
L S e q C L ( h ˜ 2 u 1 , h ˜ 2 u ) = log e x p ( s i m ( h ˜ 2 u 1 , h ˜ 2 u ) ) ( 1 m 2 N m 2 u 1 ) e x p ( s i m ( h ˜ 2 u 1 , h ˜ m ) )
where s i m ( · ) denotes the similarity measurement function, ( h ˜ 2 u 1 , h ˜ 2 u ) represents the positive sample pair, e x p ( s i m ( h ˜ 2 u 1 , h ˜ 2 u ) ) maximizes the consistency of positive pairs, and ( h ˜ 2 u 1 , h ˜ m ) denotes all other sample pairs.

3.2. Temporal-Aware Network

Within the SSL framework, the incorporation of model enhancement mechanisms enables effective integration of temporal correlation features extracted by multi-view encoders. However, direct adoption of a dual-encoder architecture may introduce asymmetry issues in the contrastive learning process, thereby increasing the difficulty of model optimization [34]. To effectively mitigate potential representation collapse in SSL, we design a temporal-aware network architecture incorporating auxiliary encoders, with its core components including a frequency-domain filtering-based MLP for global feature extraction that captures low-frequency information to better model global characteristics and long-term dependencies in user behaviors, and an RNN-based auxiliary encoder applied exclusively to one branch of the intent SSL paradigm to capture both short-term and long-term temporal dependencies in sequences. The complete architecture of this temporal-aware network is shown in Figure 4.

3.2.1. Global Feature Extraction

The global feature extraction adopts a hierarchical processing design philosophy that utilizes frequency-domain filters to capture sequence-wide characteristics, where the filter weights are automatically adjusted to adapt to data features. Through discrete Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) operations applied to the signals, high-frequency noise components are effectively eliminated, thereby ensuring the model’s enhanced capability to learn useful information during training.
Given the input representation matrix F l at layer l, we perform 1D FFT on F l to convert it from the time domain to the frequency domain, extracting global features of the sequence and capturing its low-frequency components, which typically reflect global patterns of the sequence (e.g., trends and periodicity), as mathematically formulated in Equation (13).
X l = F ( F l ) C n × d
where F ( · ) denotes the 1D FFT operation, X l represents the frequency-domain tensor, F l indicates the frequency spectrum, n denotes the cardinality of the input sample set, d specifies the feature dimension of inputs, and C signifies that matrix elements are complex numbers. In the frequency domain, a learnable filter W C n × d is applied to the frequency-domain representation X l to extract global sequence features while attenuating noise interference, as mathematically formulated in Equation (14).
X l e = W X l
where ⊙ denotes element-wise multiplication (Hadamard product), X l e represents the filtered frequency-domain representation, and the filter W is optimized via stochastic gradient descent (SGD) to adaptively function as a frequency-domain filter. The modulated frequency-domain representation X l e is then converted back to the time domain through IFFT, yielding the enhanced input representation F l e . The calculation is shown in Equation (15).
F l e = F 1 ( X l e ) R n × d
where F 1 ( · ) denotes the 1D IFFT operation that transforms complex-valued tensors back to real-valued tensors. To mitigate vanishing gradient issues and enhance training stability, we incorporate residual connections and layer normalization by summing the input F l with the filtered output F l e , followed by layer normalization and dropout operations, ultimately yielding sequence representations enriched with global features, as shown in Equation (16).
F ˜ l = L a y e r N o r m ( F l + D r o p o u t ( F l e ) )
To intuitively demonstrate how the frequency-domain filter attenuates noise in user behavior sequences, we visually distinguish between stable user preferences and irregular fluctuations. This visualization clearly reveals the working mechanism of the filter and highlights its role in enhancing the robustness and interpretability of global feature extraction, as shown in Figure 5.
As shown in Figure 5a, the original sequence in the time domain contains not only stable preference patterns but also irregular fluctuations. Such fluctuations (e.g., sudden drops) often stem from accidental clicks or one-time browsing behaviors and carry little predictive value. By applying the FFT, the sequence is decomposed into frequency components, as depicted in Figure 5b. Concentrated peaks in the low-frequency region correspond to stable, long-term user interests, whereas the scattered weak peaks beyond the red, dashed line represent noisy information. Figure 5c illustrates the learned frequency-domain filter, which exhibits the characteristics of a low-pass filter: weights in the low-frequency region remain close to one, thereby preserving stable preferences, while weights gradually decay toward zero in the mid-to-high frequency region, effectively attenuating noise. The filtering effect is presented in Figure 5d, where the red curve smooths abrupt drops and irregular fluctuations in the original sequence, making long-term trends more salient while retaining the core preference patterns. This visualization highlights the operational mechanism of the filter and demonstrates its role in enhancing both the robustness and interpretability of global feature extraction.

3.2.2. Temporal Dependency Modeling

The temporal dependency modeling introduces an RNN-based temporal encoder that captures sequential dependencies through recurrent connections. Capitalizing on the inherent capability of RNNs to effectively model long-term dependencies in sequences, we employ a pre-training strategy where this auxiliary encoder’s parameters are fixed after completing pre-training tasks. In subsequent phases, only the global feature extraction module is utilized to ensure the stability of temporal characteristics.
We train the temporal encoder by incorporating the next-item prediction task for sequential recommendation. Given an original sequence S u for user u 1 , 2 , , N , we generate two augmented views, S ˜ 2 u 1 and S ˜ 2 u , through data augmentation. During pre-training, the temporal encoder produces embeddings for the alternative view, where the augmented sequence S ˜ 2 u 1 (similarly for S ˜ 2 u ) is separately fed into both the sequence encoder f θ ( · ) and temporal encoder g φ ( · ) to generate corresponding embeddings. These embeddings are then combined through weighted fusion to produce the enhanced representation z 2 u 1 , denoted as per Equation (17).
z 2 u 1 = z 2 u 1 G + γ · z 2 u 1 T
where z 2 u 1 G denotes the embedding representation generated by the sequence encoder f θ ( · ) , z 2 u 1 T represents the embedding produced by the temporal encoder g φ ( · ) , and the hyperparameter γ controls the scaling weight in the encoder fusion process, with smaller γ values indicating the introduction of less perturbation from different encoders. Finally, the merged embedding z 2 u 1 undergoes layer normalization before being fed into a feed-forward neural network (FFN) to generate the final output representation h ˜ 2 u 1 .

3.3. Intent Contrastive Learning

The intent contrastive learning task follows the Expectation-Maximization (EM) algorithm to jointly optimize user intent representations and contrastive learning objectives. The EM algorithm begins with the E-step, which estimates the expected values of the latent variables c i starting from an initial guess of the sequence encoder parameters θ , followed by the M-step that maximizes the parameters θ based on the values of variables c i . In the E-step, we assume there exist K latent intent prototypes C i i = 1 K in the sequences, where these intents influence users’ interaction decisions with items, and we optimize the objective function in Equations (18) and (19).
θ * = argmax θ u = 1 N t = 1 T ln E ( c ) ln P θ ( v t u )
L i = u = 1 N t = 1 T i = 1 K Q ( c i ) ln P 0 ( ν t u , c i )
where Q ( c i ) denotes the distribution function of c i , and L i represents the lower bound of the objective function at step t. Following the EM algorithm for model optimization, we perform clustering operations on the vector representations h u of user sequences S u using the K-Means algorithm, partitioning users into K clusters. The centroid of each cluster corresponds to the intent vector c i , which serves as the intent representation for the i-th cluster. The distribution function Q ( c i ) for intent representation learning takes two possible values: Q ( c i ) = 1 if S u belongs to cluster c i , and Q ( c i ) = 1 otherwise.
During the M-step, to mitigate the potential issue of false negative samples in the loss function construction—wherein identical intents from distinct users may be erroneously treated as negative samples—we synthesize both data augmentation and model augmentation to generate positive view representations h ˜ 2 u 1 and h ˜ 2 u , subsequently formulating a temporal contrastive loss function L T C L for parameter optimization, which is formally defined by Equations (20) and (21).
L T C L = L T C L ( h ˜ 2 u 1 , c u ) + L T C L ( h ˜ 2 u , c u )
L T C L ( h ˜ 2 u 1 , c u ) = log e x p ( s i m ( h ˜ 2 u 1 , c u ) ) v = 1 N 1 V F e x p ( s i m ( h ˜ 2 u 1 , c v ) )
where F denotes the set of users sharing identical intents, and c u represents the vector representation of intent c i . During the iterative execution of the EM-Step procedure, both the intent distribution Q ( c ) and model parameters θ are continuously updated.

3.4. Multi-Task Training

The multi-task training strategy is employed to jointly optimize the sequential recommendation model by sharing parameters of the sequence encoder, simultaneously optimizing the following tasks: the sequential recommendation prediction task, the intent SSL task, and the sequence SSL task. This joint optimization framework is formally expressed in Equation (22).
L = L N e x t I t e m + λ L T C L + μ L s e q C L
where the parameters λ and μ denote the weights of the Intent SSl task and Sequence SSL task, respectively, the Sequence Recommendation Prediction task focuses on the local patterns in the sequence, the Intent SSL task focuses on the user’s latent intent, and the Sequence SSL task focuses on the global structure of the sequence.

4. Experiments

This section first introduces the selected datasets, evaluation metrics, and baseline models, followed by comprehensive comparisons between TCLRec and various benchmark models of different types, along with ablation studies. Finally, we investigate the optimal data augmentation strategies for the model and the impact of hyperparameters on performance, and we conduct robustness experiments on key modules.

4.1. Datasets and Evaluation Metrics

This study conducts experiments on three publicly available datasets: the Beauty and Sports subsets from the Amazon Reviews dataset [35], and the LastFM dataset, which is widely adopted in the fields of music recommendation and music information retrieval. The Beauty and Sports datasets primarily consist of e-commerce product purchase behaviors, with user interactions mainly represented by purchase records. The LastFM dataset provides detailed records of each user’s favorite artists along with the play counts for each artist. By analyzing these listening histories, we were able to successfully extract user–item interaction records. The experimental datasets are preprocessed by labeling interaction records containing numerical ratings or textual reviews as positive samples, while all other interactions are treated as negative samples. This study exclusively utilizes the ‘5-core’ dataset, where each user has purchased at least five items, and each item has been purchased by at least five users. The statistical characteristics of the dataset are presented in Table 1.
We employ two widely used metrics, HR@k and NDCG@k, for evaluation. The parameter k (number of recommended items considered) is set to k { 5 , 10 , 20 } in our experiments. HR@k measures whether the user’s actual interacted-with items appear in the top-k recommendations, reflecting recommendation accuracy. NDCG@k further considers the ranking positions of the hit items, evaluating the ranking quality of the recommendation list. Let U denote the set of users and g u represent the rank of the actual interacted item for user u in the recommendation list. Then, HR@k and NDCG@k can be defined as
HR @ k = 1 | U | u U I ( g u k )
NDCG @ k = 1 | U | u U I ( g u k ) log 2 ( g u + 1 )
where I ( · ) is an indicator function that returns 1 if the condition is true and 0 otherwise. These metrics are computed for each user in the test set, and the average over all users is reported as the overall performance.

4.2. Baseline Methods

Four groups of baseline methods are included for comparison.
  • Non-sequential models: BPR-MF [36] employs matrix factorization to represent users and items as low-dimensional vectors, optimizing these vectors to maximally capture preference relationships between users and items.
  • Standard sequential models: GRU4Rec [9] is a prevalent RNN variant designed to address the vanishing gradient problem in traditional RNNs when processing long sequences. SASRec [12] employs the self-attention mechanism from Transformer models as its core approach to effectively capture sequential dependencies in users’ historical behaviors.
  • Sequential models with additional SSL: BERT4Rec [13] adopts a bidirectional self-attention mechanism by reformulating the next-item prediction task as a Cloze task, enabling the model to simultaneously capture relationships between items and their contexts. CL4SRec [18] integrates contrastive SSL with Transformer-based sequential recommendation models, enhancing the model’s generalization capability through augmentation operators such as cropping, masking, and reordering.
  • Sequential models considering latent Intents: DSSRec [25] employs a seq2seq architecture to process user behavior sequences while performing optimization in latent space. ICLRec [27] utilizes contrastive learning to capture latent intents in user sequences, introducing intent variables into the sequential recommendation model through clustering. IOCRec [28] applies contrastive learning by selecting users’ primary intents for denoising, thereby creating high-quality intent views. ELCRec [30] integrates behavior representation learning into an end-to-end clustering framework, thereby enhancing the overall performance of the model.

4.3. Implementation Details

Our experiments were carried out on a server running Ubuntu 22.04 with an Intel Xeon Platinum 8255C CPU and an NVIDIA RTX 2080 Ti GPU. The implementation was based on Python 3.10 and PyTorch 2.1.0 with CUDA 12.1 support. The experiments employed the same parameter settings for all models. For all self-attention-based models, the number of self-attention blocks and attention heads is set to 2, and the embedding dimension is set to 64. The learning rate set to 0.001 and the batch size set to 256. The model is optimized using the Adam optimizer, which adaptively adjusts learning rates for individual parameters to accelerate convergence and enhance training efficacy.

4.4. Overall Performance Comparison

To validate the effectiveness of the model, comparative experiments are conducted on three datasets. The best results in this paper are highlighted in bold, while the second-best results are underlined. The experimental results are shown in Table 2.
The experimental results demonstrate that TCLRec outperforms all baseline methods across all metrics on three public datasets. Compared to the second-best results, the HR metric shows average improvements of 3.79% and 9.20% on the Beauty and Sports datasets, respectively, while the NDCG metric achieves average improvements of 10.27% and 12.34% on these datasets. Compared with the second-best results on the LastFM dataset, the HR and NDCG metrics show average improvements of 5.19% and 7.29%, respectively. These results validate the effectiveness of the sequence recommendation method based on temporal awareness and intent contrastive learning.
The non-sequential recommendation model BPR demonstrates suboptimal recommendation performance. BPR employs traditional collaborative filtering that focuses on static preference modeling, whereas sequential recommendation models (e.g., RNN-based or Transformer-based models) can explicitly model temporal ordering and contextual dependencies in user behaviors. This indicates that capturing and learning sequential patterns is crucial for sequential recommendation tasks.
SSL-based methods demonstrate significant performance advantages over conventional sequential recommendation approaches, underscoring the critical role of the data augmentation operations examined in this study. By constructing rich self-supervised signals, the model can more effectively capture the complex nonlinear relationships between user behaviors. SSL methods can also leverage unlabeled data to extract semantic features, thereby enhancing the expressive capacity of both user and item representations. Among SSL-based methods, CL4SRec demonstrates superior performance compared to BERT4Rec. This is because BERT4Rec relies on the implicit learning of masked language models, where its random masking strategy may disrupt the temporal continuity of sequences. In contrast, CL4SRec explicitly models semantic relationships among items within sequences by constructing positive and negative sample pairs, while its contrastive learning framework more effectively captures local dependencies in sequences. TCLRec outperforms both models because the incorporation of latent intent information enables better learning of users’ periodic interest patterns, thereby enhancing recommendation performance in personalized scenarios. Moreover, the adopted multi-task joint learning strategy more effectively integrates the two contrastive learning tasks with the recommendation task, further capturing semantic feature information from user interaction sequences.
TCLRec demonstrates significant improvements in recommendation performance over existing sequential recommendation models that consider latent user intents. This performance gain can be attributed to the limitations of methods such as ICLRec and ELCRec, which introduce noise through random data augmentation, thereby disrupting the integrity of contextual semantic information in user sequences. In contrast, the proposed model incorporates a semantic enhancement mechanism during the view construction stage. By maximizing the similarity between the original sequence and its semantically augmented view in the representation space, the augmented view preserves the semantic consistency of user behaviors. Furthermore, the integration of a frequency-domain filtering-based temporal-aware network assists in preceding item generation and enhances the model’s capacity to represent global sequential information and temporal characteristics.

4.5. Ablation Study of TCLRec

To further investigate the impact of each component on the performance of TCLRec, we conducted ablation experiments, as shown in Table 3. We define the model without the semantic enhancement module as TCLRec-1, the model without the temporal-aware network as TCLRec-2, and the model retaining only the temporal dependency modeling module in the temporal-aware network as TCLRec-3, all compared against the baseline model ICLRec.
The experimental results demonstrate that all variants of TCLRec outperform the baseline model ICLRec, indicating that TCLRec enhances temporal modeling capabilities for user sequences and exhibits superior potential in capturing latent user intents, thereby validating the rationality and effectiveness of its components. Furthermore, TCLRec achieves better performance than all other variant models across all evaluation metrics. The removal of the semantic enhancement module (TCLRec-1) leads to a consistent decline in performance across all evaluation metrics, indicating that the integration of the S S and S I semantic augmentation operators allows the model to effectively capture comprehensive user semantic features, thereby enhancing the encoder’s capability for intent extraction and strengthening the performance of intent-driven self-supervised learning. The removal of the temporal-aware network (TCLRec-2) results in a significant performance decline, demonstrating that the frequency-domain filtering algorithm effectively eliminates noise interference, while the temporal information encoding helps the model identify relationships between different intent representations, enabling the model to focus on meaningful temporal patterns. The removal of the temporal dependency modeling module (TCLRec-3) also leads to a notable performance degradation, which can be attributed to the fact that the auxiliary encoder-based model enhancement constructs complementary views for sequence SSL, thereby mitigating the limitations of a single-encoder perspective and improving the model’s capacity to capture complex user behavior patterns.

4.6. Effectiveness of Data Augmentation

In this section, to further validate the effectiveness of the semantic enhancement module in TCLRec, we replaced it with three representative data augmentation strategies and compared the resulting variants with TCLRec. TCLRec-RPT [37] incorporates three augmentation strategies—Rotation, Permutation, and Time Warping—to enhance robustness against temporal shifts and localized behavioral patterns. TCLRec-Flip [38] applies a geometric flipping operation that inverts the sign of sequence elements to generate augmented samples, while TCLRec-FDA [39] leverages Formulated Data Augmentation (FDA), which generates augmented samples through cropping, masking, reordering, and pooling. The performance of these three models was systematically evaluated on the Beauty and Sports datasets, with comprehensive results presented in Table 4.
The experimental results demonstrate that TCLRec significantly outperforms existing data augmentation-based sequential recommendation models. The relatively poor performance of TCLRec-RPT and TCLRec-Flip is primarily due to their reliance on geometric perturbations, which overlook the semantic dependencies among items within user sequences, thereby hindering effective modeling of user behavior preferences. TCLRec-FDA generates augmented samples that may deviate from authentic user behavior patterns, particularly because the pooling operation introduces item embeddings that weaken semantic coherence across samples, reducing the stability of model training.
In contrast, the semantic enhancement operators proposed in TCLRec are preference-aware and effectively preserve the core semantic characteristics of behavioral sequences. This enables the model to capture richer interaction signals from multiple augmented views, achieving a better balance between representing user preferences and accurately expressing semantics. Furthermore, the experimental results clearly indicate that adopting differentiated augmentation strategies for long and short sequences in TCLRec is necessary. Specifically, the collaborative effect of the S S , S I , and C operators effectively preserves the essential semantics of short sequences while mitigating semantic loss caused by excessive perturbations.

4.7. Augmentation Analysis

This study compares five augmentation operators for contrastive SSL tasks. To investigate the impact of different data augmentation approaches on model performance, we systematically evaluate three ablation configurations on the Beauty and Sports datasets. The first set employs the leave-one-out method, where each experiment removes one augmentation operator from the set { S S , S I , M, R, C}. The second set adopts the pairwise method, applying only two augmentation operators per experiment to create augmented views for contrastive SSL tasks. Finally, we examine different data augmentation approaches specifically for short sequences to identify their optimal configuration.

4.7.1. Leave-One-Out Comparison

The experiments validate the performance variations under different data augmentation approaches in the leave-one-out setting across both datasets. TCLRec employs { S S , S I , M, R, C} for long sequences and { S S , S I , C} for short sequences, while the baseline model ICLRec uses random augmentation {M, R, C} for both sequence types. For each remaining experimental group, one operator is removed from the full set { S S , S I , M, R, C}, for instance, ‘w/o S S ’ indicates removing the S S semantic augmentation operator, resulting in { S I , M, R, C} for long sequences and { S I , C} for short sequences. The experimental results are shown in Figure 6. It can be observed that TCLRec exhibits significant performance degradation on both datasets when either the S S or S I semantic augmentation operators are removed. This demonstrates that preference-based semantic augmentation can effectively preserve core semantic features of sequences and optimize the balance between original information representation and semantic accuracy, outperforming existing random augmentation operators (M and R).

4.7.2. Pairwise Comparison

The experimental results demonstrate the influence of various data augmentation methods on model performance in the pairwise setting, with the results shown in Figure 7. It can be observed that for identical operator pairs (diagonal values), S S and S I demonstrate optimal performance on the Beauty and Sports datasets, respectively. This indicates that incorporating semantic augmentation methods considering item correlations in contrastive learning tasks can generate higher-quality positive sample sequences, thereby more effectively enhancing model recommendation performance. The random augmentation operators R and M demonstrate inferior performance on both datasets, indicating that masking and reordering operations may disrupt the authentic temporal logic in user behavior sequences, consequently reducing the confidence of positive sample pairs and impairing the model’s recommendation performance. Furthermore, while S S shows suboptimal performance on Sports alone, its combinations with other operators exhibit significant performance advantages, demonstrating that S S can effectively provide complementary views to enhance view diversity in contrastive SSL. Compared with the experimental results in Section 4.7.1, it can be observed that employing multiple augmentation operators yields better performance than using single or dual operators. This is because the utilization of multiple augmentation operators enables the model to capture more comprehensive interaction information from diverse perspectives, thereby learning more generalizable feature representations.

4.7.3. Augmentation Set for Short Sequences

This study accounts for the differing sensitivities of long and short sequences to data augmentation operations. For long sequences, we employ all five augmentation operators, while for short sequences, various augmentation combinations are constructed for comparative experiments to analyze their impact on model performance. The baseline model ICLRec uses only the random augmentation set {M, R, C}, and other experimental groups incorporate semantic augmentation operators S S and S I . The performance of the model on different datasets is illustrated in Figure 8.
The results demonstrate that most combinations incorporating semantic augmentation operators outperform ICLRec, validating the effectiveness and necessity of introducing semantic enhancement mechanisms for short sequence modeling. The performance degrades when C in { S S , S I , C} is replaced with M or R, primarily because M and R fail to consider item correlations and may introduce noise that adversely affects model training. The performance superiority of the { S S , S I , C} combination over { S S , S I , M, R, C} establishes that distinct augmentation operator combinations are required for long versus short sequences. The synergistic effects of the S S , S I , and C operators can effectively preserve core semantic features in short sequences while avoiding information loss caused by excessive perturbations.

4.8. Parametric Analysis

To assess the impact of key hyperparameter settings on model performance, this section presents experimental analysis on the semantic replacement ratio, semantic insertion ratio, long-short sequence threshold, number of intent categories, and intensity of intent contrastive learning.

4.8.1. Performance Impact of Semantic Augmentation Ratios

To evaluate the impact of different semantic replacement ratios α and semantic insertion ratios β on model performance, we conducted experiments with α and β values ranging over {0.1, 0.2, 0.3, 0.4, 0.5}. The experimental results are shown in Figure 9. The optimal value of α is consistently 0.1 across both datasets, indicating that preserving more original sequence information is crucial for model learning. Higher replacement ratios may excessively modify the semantic structure of the data, potentially creating false positive samples that could undermine the sequence SSL task.
The optimal values for parameter β are 0.5 and 0.4 on the Beauty and Sports datasets, respectively. This difference arises from the distinct length skewness between the two datasets; longer sequences have higher probabilities of containing substantial time intervals between interactions, thus requiring more intensive data augmentation. Insufficient data processing would lead to suboptimal performance. It is noteworthy that on the Beauty dataset, when α = 0.5 with β = 0.2 and α = 0.5 with β = 0.5 , the model performance remains robust despite high insertion rates, demonstrating superior results. This indicates that the semantic insertion operation effectively accounts for contextual item correlations, and even at high insertion rates, can enhance sequence semantic representation to improve model performance.

4.8.2. Performance Impact of Long-Short Sequence Threshold

The experiments validate the performance variations when different values are assigned to the long-short sequence threshold L across both datasets, with results shown in Figure 10. The value range for L was set to {0, 2, 4, 6, 8, 10, 12, 14}. It can be observed that the optimal values for parameter L are 10 and 6 on the Beauty and Sports datasets, respectively. By setting different L values for the two datasets, the model’s adaptability to dataset characteristics can be significantly improved. When L is set too small, the model exhibits suboptimal performance because applying the { S S , S I , C} augmentation combination to longer sequences leads to insufficient model training, making it difficult to fully capture deep patterns in sequences. Conversely, when L is too large, the performance degrades, as the { S S , S I , M, R, C} combination applied to shorter sequences makes them more vulnerable to random perturbations, where excessive augmentation may distort or even erase semantic information. The experimental results demonstrate that larger L values better handle complex sequential patterns, while smaller L values are more suitable for short sequences, enabling the model to focus faster on core information while avoiding excessive attention to long-tail noise.
In our model, the long-short sequence threshold L is a fixed hyperparameter used to adaptively distinguish between short and long sequences, guiding the model to apply suitable augmentation strategies for sequences of different lengths. The fixed nature of L offers several benefits. First, it ensures stability and simplicity in training: by using a predetermined threshold, the model’s backbone can focus on learning meaningful sequence representations and user intent without introducing additional uncertainty from dynamically learned decisions. Second, it enhances interpretability, as the classification of sequences into “short” and “long” is explicitly defined, allowing us to clearly explain how sequences of different lengths are treated.

4.8.3. Performance Impact of Intent Contrastive Learning Parameter Settings

The experiments comprehensively evaluate the impact of intent contrastive learning tasks on model performance through two key aspects: the number of intent categories K and the intensity of intent contrastive learning λ , with results presented in Figure 11. As shown in Figure 11a, TCLRec achieves optimal performance when K increases to 512. This indicates that when K is small, the contrastive SSL process may generate more false positive samples, where users with different intentions are incorrectly grouped into the same intent category, preventing the model from effectively learning genuine user intents. When K is too large, users who should belong to the same intent category are mistakenly divided into different categories, which may degrade the quality of positive sample pairs in contrastive learning and exacerbate convergence difficulties during model training.
Figure 11b indicates that the peak performance occurs when λ = 0.5 . If λ is too small, the intent signals become excessively weak, which may prevent the model from sufficiently capturing the latent intents behind user behaviors and effectively guiding representation learning. Conversely, an excessively large λ may cause the intent signals to dominate the training process, thereby interfering with the learning of the main sequential recommendation task, which is detrimental to the improvement of the model’s recommendation performance. This demonstrates that in multi-task learning frameworks, introducing intent contrastive learning as an auxiliary task incorporates appropriate contrastive signals to improve the model’s capacity for user intent understanding, leading to measurable enhancements in recommendation performance.

4.9. Robustness Analysis

To address the prevalent data sparsity issue in recommendation systems, we design experiments to simulate data environments with varying sparsity levels for verifying model robustness. While keeping the test data unchanged, we train the model with different proportions of training data (25%, 50%, 75%, and 100%), and compare the proposed method with the strongest baseline model, ICLRec, on both Beauty and Sports datasets, as shown in Figure 12. TCLRec exhibits significantly smaller performance fluctuations compared to the baseline model, demonstrating superior robustness. This is attributed to the rating-guided data augmentation process of TCLRec, which generates semantically coherent augmented sequences that better align with user preferences. This approach effectively enriches supervisory signals in sparse scenarios, thereby mitigating the sample scarcity issue during model training. Simultaneously, through frequency-domain filtering, TCLRec constructs higher-confidence positive views for contrastive SSL objectives, endowing the encoder with enhanced robustness during training.
To further validate model robustness, we evaluated the performance variation of TCLRec under different noise intensities in the test data. The experiment trains the model on the original training data while randomly injecting negative user–item interactions at varying proportions (10%, 20%, 30%, 40%, and 50%) into each test sequence. Figure 13 displays the experimental results. The introduction of noisy data degrades the performance of both ICLRec and TCLRec. However, TCLRec consistently exhibits a lower performance degradation rate than ICLRec, and its performance at a 20% noise ratio still surpasses that of ICLRec without noisy data. This demonstrates that the learnable filter can effectively attenuate noise by suppressing high-frequency noise components while enhancing low-frequency semantic features, thereby extracting meaningful representations across all frequency bands and improving the model’s adaptability to noisy data.

5. Conclusions

This paper proposes a sequence recommendation model based on temporal awareness and intent contrastive learning. Starting from global preferences across sequences, the model employs a temporal-aware network to capture both long- and short-term dependencies in user sequences. By adopting a contrastive learning mechanism that implicitly computes time intervals, it facilitates dual encoders to obtain self-supervised signals from multiple perspectives, thereby enhancing the extraction capability for latent intent variables. Meanwhile, the augmentation process is guided by both user rating preferences and item relevance. The threshold for distinguishing between long and short sequences is adaptively adjusted according to dataset distribution and sequence length characteristics, enabling the model to better focus on users’ authentic semantic preferences under different behavioral patterns. The experimental results on three public datasets demonstrate that the proposed TCLRec outperforms other sequential recommendation models, with ablation studies further confirming the positive contribution of each component module to recommendation performance.
Although the proposed method achieves significant improvements in recommendation performance, there remains room for further enhancement. For instance, our current work assumes user behavior sequences are driven by a single intent, whereas real-world scenarios often involve multiple concurrent intents. Future research could explore multi-intent modeling mechanisms to better capture these complex intent patterns, thereby further improving model performance. Furthermore, we aim to apply TCLRec in practical scenarios to rigorously evaluate its adaptability and effectiveness in real-world industrial recommendation systems.

Author Contributions

All of the authors contributed to the study conception and design. Conceptualization, Y.F. and Y.Z.; methodology, Y.F.; software, T.S.; validation, Y.F., T.S. and A.W.; formal analysis, Y.Z.; investigation, Y.F.; resources, Y.Z.; data curation, T.S.; writing—original draft preparation, Y.F.; writing—review and editing, Y.F.; visualization, Y.F.; supervision, A.W.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) (62401013), the Major Natural Science Research Project of Anhui Provincial Universities (2024AH040039), and the Anhui Provincial University Excellent Talent Support Program Key Projects (gxyqZD2021124).

Data Availability Statement

The data that support the findings of this study are open access and can be found in the respective databases at https://github.com/FFF0-i/Amazon-Beauty-Dataset, https://github.com/FFF0-i/LastFM-Dataset, https://github.com/FFF0-i/Amazon-Sports-Dataset (accessed on 21 January 2025).

Conflicts of Interest

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article.

References

  1. Mao, C.; Wu, Z.; Liu, Y.; Shi, Z. Matrix factorization recommendation algorithm based on attention interaction. Symmetry 2024, 16, 267. [Google Scholar] [CrossRef]
  2. Xie, F.; Wang, M.; Peng, J.; Shen, D. Differential Weighting and Flexible Residual GCN-Based Contrastive Learning for Recommendation. Symmetry 2025, 17, 1320. [Google Scholar] [CrossRef]
  3. Su, J.; Chen, C.; Lin, Z.; Li, X.; Liu, W.; Zheng, X. Personalized Behavior-Aware Transformer for Multi-Behavior Sequential Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 6321–6331. [Google Scholar] [CrossRef]
  4. Xia, J.; Li, D.; Gu, H.; Lu, T.; Zhang, P.; Shang, L.; Gu, N. Oracle-guided Dynamic User Preference Modeling for Sequential Recommendation. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, Hannover, Germany, 10–14 March 2025; pp. 363–372. [Google Scholar] [CrossRef]
  5. He, R.; McAuley, J. Fusing similarity models with markov chains for sparse sequential recommendation. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 191–200. [Google Scholar] [CrossRef]
  6. Hidasi, B.; Tikk, D. General factorization framework for context-aware recommendations. Data Min. Knowl. Discov. 2016, 30, 342–371. [Google Scholar] [CrossRef]
  7. Chen, T.; Yin, H.; Nguyen, Q.V.H.; Peng, W.C.; Li, X.; Zhou, X. Sequence-aware factorization machines for temporal predictive analytics. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; pp. 1405–1416. [Google Scholar] [CrossRef]
  8. Wu, C.Y.; Ahmed, A.; Beutel, A.; Smola, A.J.; Jing, H. Recurrent recommender networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 495–503. [Google Scholar] [CrossRef]
  9. Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; Tikk, D. Session-based recommendations with recurrent neural networks. arXiv 2015, arXiv:1511.06939. [Google Scholar] [CrossRef]
  10. Damak, K.; Khenissi, S.; Nasraoui, O. Debiasing the cloze task in sequential recommendation with bidirectional transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 273–282. [Google Scholar] [CrossRef]
  11. Du, H.; Shi, H.; Zhao, P.; Wang, D.; Sheng, V.S.; Liu, Y.; Liu, G.; Zhao, L. Contrastive learning with bidirectional transformers for sequential recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 396–405. [Google Scholar] [CrossRef]
  12. Zivic, P.; Vazquez, H.; Sánchez, J. Scaling Sequential Recommendation Models with Transformers. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 1567–1577. [Google Scholar] [CrossRef]
  13. Sun, F.; Liu, J.; Wu, J.; Pei, C.; Lin, X.; Ou, W.; Jiang, P. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1441–1450. [Google Scholar] [CrossRef]
  14. Jin, B.; Gao, C.; He, X.; Jin, D.; Li, Y. Multi-behavior recommendation with graph convolutional networks. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; pp. 659–668. [Google Scholar] [CrossRef]
  15. Xia, L.; Huang, C.; Xu, Y.; Dai, P.; Zhang, X.; Yang, H.; Pei, J.; Bo, L. Knowledge-enhanced hierarchical graph transformer network for multi-behavior recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 4486–4493. [Google Scholar] [CrossRef]
  16. Zhou, K.; Yu, H.; Zhao, W.X.; Wen, J.R. Filter-enhanced MLP is all you need for sequential recommendation. In Proceedings of the ACM Web Conference, Lyon, France, 25–29 April 2022; pp. 2388–2399. [Google Scholar] [CrossRef]
  17. Zhou, K.; Wang, H.; Zhao, W.X.; Zhu, Y.; Wang, S.; Zhang, F.; Wang, Z.; Wen, J.R. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020; pp. 1893–1902. [Google Scholar] [CrossRef]
  18. Xie, X.; Sun, F.; Liu, Z.; Wu, S.; Gao, J.; Zhang, J.; Ding, B.; Cui, B. Contrastive learning for sequential recommendation. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 1259–1273. [Google Scholar] [CrossRef]
  19. Liu, Z.; Chen, Y.; Li, J.; Yu, P.S.; McAuley, J.; Xiong, C. Contrastive self-supervised sequential recommendation with robust augmentation. arXiv 2021, arXiv:2108.06479. [Google Scholar] [CrossRef]
  20. Qiu, R.; Huang, Z.; Yin, H.; Wang, Z. Contrastive learning for representation degeneration problem in sequential recommendation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event, 21–25 February 2022; pp. 813–823. [Google Scholar] [CrossRef]
  21. Wang, L.; Lim, E.P.; Liu, Z.; Zhao, T. Explanation guided contrastive learning for sequential recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 2017–2027. [Google Scholar] [CrossRef]
  22. Wang, X.; Yue, H.; Wang, Z.; Xu, L.; Zhang, J. Unbiased and Robust: External Attention-enhanced Graph Contrastive Learning for Cross-domain Sequential Recommendation. In Proceedings of the 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, 1–4 December 2023; pp. 1526–1534. [Google Scholar] [CrossRef]
  23. Huang, C.; Wang, S.; Wang, X.; Yao, L. Dual contrastive transformer for hierarchical preference modeling in sequential recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 99–109. [Google Scholar] [CrossRef]
  24. Qin, X.; Yuan, H.; Zhao, P.; Fang, J.; Zhuang, F.; Liu, G.; Liu, Y.; Sheng, V. Meta-optimized contrastive learning for sequential recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 89–98. [Google Scholar] [CrossRef]
  25. Ma, J.; Zhou, C.; Yang, H.; Cui, P.; Wang, X.; Zhu, W. Disentangled self-supervision in sequential recommenders. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 483–491. [Google Scholar] [CrossRef]
  26. Tan, Q.; Zhang, J.; Yao, J.; Liu, N.; Zhou, J.; Yang, H.; Hu, X. Sparse-interest network for sequential recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual, 8–12 March 2021; pp. 598–606. [Google Scholar] [CrossRef]
  27. Chen, Y.; Liu, Z.; Li, J.; McAuley, J.; Xiong, C. Intent contrastive learning for sequential recommendation. In Proceedings of the ACM Web Conference, Lyon, France, 25–29 April 2022; pp. 2172–2182. [Google Scholar] [CrossRef]
  28. Li, X.; Sun, A.; Zhao, M.; Yu, J.; Zhu, K.; Jin, D.; Yu, M.; Yu, R. Multi-intention oriented contrastive learning for sequential recommendation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 411–419. [Google Scholar] [CrossRef]
  29. Qin, X.; Yuan, H.; Zhao, P.; Liu, G.; Zhuang, F.; Sheng, V.S. Intent contrastive learning with cross subsequences for sequential recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida, Mexico, 4–8 March 2024; pp. 548–556. [Google Scholar] [CrossRef]
  30. Liu, Y.; Zhu, S.; Xia, J.; Ma, Y.; Ma, J.; Liu, X.; Yu, S.; Zhang, K.; Zhong, W. End-to-end learnable clustering for intent learning in recommendation. Adv. Neural Inf. Process. Syst. 2024, 37, 5913–5949. [Google Scholar]
  31. Zhou, P.; Huang, Y.L.; Xie, Y.; Gao, J.; Wang, S.; Kim, J.B.; Kim, S. Is contrastive learning necessary? a study of data augmentation vs contrastive learning in sequential recommendation. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 3854–3863. [Google Scholar] [CrossRef]
  32. Wang, C.; Ma, W.; Chen, C.; Zhang, M.; Liu, Y.; Ma, S. Sequential recommendation with multiple contrast signals. ACM Trans. Inf. Syst. 2023, 41, 1–27. [Google Scholar] [CrossRef]
  33. Zhang, P.; Yan, Y.; Zhang, X.; Li, C.; Wang, S.; Huang, F.; Kim, S. TransGNN: Harnessing the collaborative power of transformers and graph neural networks for recommender systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 1285–1295. [Google Scholar] [CrossRef]
  34. Liu, Z.; Chen, Y.; Li, J.; Luo, M.; Yu, P.S.; Xiong, C. Improving contrastive learning with model augmentation. arXiv 2022, arXiv:2203.15508. [Google Scholar] [CrossRef]
  35. McAuley, J.; Targett, C.; Shi, Q.; Van Den Hengel, A. Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 43–52. [Google Scholar] [CrossRef]
  36. Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv 2012, arXiv:1205.2618. [Google Scholar] [CrossRef]
  37. Um, T.T.; Pfister, F.M.; Pichler, D.; Endo, S.; Lang, M.; Hirche, S.; Fietzek, U.; Kulić, D. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 216–220. [Google Scholar] [CrossRef]
  38. Wen, Q.; Sun, L.; Yang, F.; Song, X.; Gao, J.; Wang, X.; Xu, H. Time series data augmentation for deep learning: A survey. arXiv 2020, arXiv:2002.12478. [Google Scholar] [CrossRef]
  39. Chen, J.; Zou, G.; Zhou, P.; Yirui, W.; Chen, Z.; Su, H.; Wang, H.; Gong, Z. Sparse enhanced network: An adversarial generation method for robust augmentation in sequential recommendation. Aaai Conf. Artif. Intell. 2024, 38, 8283–8291. [Google Scholar] [CrossRef]
Figure 1. Architecture of the TCLRec model.
Figure 1. Architecture of the TCLRec model.
Symmetry 17 01634 g001
Figure 2. Illustration of the semantic substitution process.
Figure 2. Illustration of the semantic substitution process.
Symmetry 17 01634 g002
Figure 3. Illustration of the semantic insertion process.
Figure 3. Illustration of the semantic insertion process.
Symmetry 17 01634 g003
Figure 4. Architecture of the temporal-aware network.
Figure 4. Architecture of the temporal-aware network.
Symmetry 17 01634 g004
Figure 5. Frequency–domain noise and filtering effects in user behavior sequences.
Figure 5. Frequency–domain noise and filtering effects in user behavior sequences.
Symmetry 17 01634 g005
Figure 6. Performance impact of different augmentation sets on both datasets.
Figure 6. Performance impact of different augmentation sets on both datasets.
Symmetry 17 01634 g006
Figure 7. Performance impact of different augmentation pairs on both datasets.
Figure 7. Performance impact of different augmentation pairs on both datasets.
Symmetry 17 01634 g007
Figure 8. Performance impact of different data augmentation methods for short sequences on both datasets.
Figure 8. Performance impact of different data augmentation methods for short sequences on both datasets.
Symmetry 17 01634 g008
Figure 9. Performance impact of semantic augmentation ratios on both datasets.
Figure 9. Performance impact of semantic augmentation ratios on both datasets.
Symmetry 17 01634 g009
Figure 10. Performance impact of long-short sequence threshold on both datasets.
Figure 10. Performance impact of long-short sequence threshold on both datasets.
Symmetry 17 01634 g010
Figure 11. Performance impact of intent contrastive learning parameter settings on both datasets.
Figure 11. Performance impact of intent contrastive learning parameter settings on both datasets.
Symmetry 17 01634 g011
Figure 12. Performance impact of different training data proportions on both datasets.
Figure 12. Performance impact of different training data proportions on both datasets.
Symmetry 17 01634 g012
Figure 13. Performance impact of different noise ratios on both datasets.
Figure 13. Performance impact of different noise ratios on both datasets.
Symmetry 17 01634 g013
Table 1. Statistical information on the datasets.
Table 1. Statistical information on the datasets.
DatasetUsersItemsInteractionsAvg. LengthSparsity (%)
Beauty22,36312,101198,5028.999.73
Sports35,59818,357296,3378.399.95
LastFM1090364652,55148.298.68
Table 2. Comprehensive performance of all models on three datasets, with second-best results underlined and best results in bold.
Table 2. Comprehensive performance of all models on three datasets, with second-best results underlined and best results in bold.
DatasetMetricBPR-MFGRU4RecSASRecBert4RecCL4SRecDSSRecICLRecIOCRecELCRecTCLRec
BeautyHR@50.01780.01800.03770.03600.04010.04080.04280.04270.04780.0514
HR@100.02960.02840.06240.06010.06420.06160.06680.06760.07120.0734
HR@200.04740.04780.08940.09840.09740.08940.09930.10050.09940.1019
NDCG@50.01090.01160.02410.02160.02680.02630.02690.02760.03080.0359
NDCG@100.01470.01500.03420.03000.03450.03290.03520.03570.03870.0430
NDCG@200.01920.01860.03860.03910.04280.03990.04340.04400.04620.0502
SportsHR@50.01230.01620.02140.02170.02310.02090.02640.02580.02650.0306
HR@100.02150.02580.03330.03590.03690.03280.04190.04120.04100.0459
HR@200.03690.04210.05000.06040.05570.04990.06370.06240.06340.0674
NDCG@50.00760.01030.01440.01430.01460.01390.01760.01690.01770.0209
NDCG@100.01050.01420.01770.01900.01910.01780.02260.02190.02240.0257
NDCG@200.01440.01860.02180.02510.02380.02210.02810.02720.02800.0311
LastFMHR@50.01910.02390.03460.03760.03120.03790.03030.04410.02660.0460
HR@100.03650.03580.05430.05810.01430.05880.04680.06080.03390.0649
HR@200.05410.04950.07710.08620.07980.08690.06880.09640.05140.1016
NDCG@50.01440.01550.02500.02630.02130.02690.01930.03180.01850.0344
NDCG@100.01880.01940.03150.03320.02530.03380.02460.03620.02080.0409
NDCG@200.02460.02280.03740.04000.03430.04010.03000.04840.02520.0498
Table 3. Ablation experimental results of TCLRec.
Table 3. Ablation experimental results of TCLRec.
DatasetMetricICLRecTCLRec-1TCLRec-2TCLRec-3TCLRec
BeautyHR@50.04280.04950.04790.04870.0514
HR@100.06680.07230.07170.07190.0734
HR@200.09930.10080.09220.10010.1019
NDCG@50.02690.03360.03220.03260.0359
NDCG@100.03520.04070.03980.03920.0430
NDCG@200.04340.04830.04750.04910.0502
SportsHR@50.02640.02700.02690.02790.0306
HR@100.04190.04300.04280.04350.0459
HR@200.06370.06440.06350.06560.0674
NDCG@50.01760.01810.01820.01870.0209
NDCG@100.02260.02320.02330.02400.0257
NDCG@200.02810.02860.02850.02940.0311
LastFMHR@50.03030.04200.04030.04110.0460
HR@100.04680.05840.05500.05750.0649
HR@200.06880.08050.08060.08050.1016
NDCG@50.01930.02740.02500.02750.0344
NDCG@100.02460.03360.02760.03220.0409
NDCG@200.03000.04580.04100.04320.0498
Table 4. Ablation study of the semantic enhancement module on the Beauty and Sports datasets.
Table 4. Ablation study of the semantic enhancement module on the Beauty and Sports datasets.
DatasetMetricsTCLRec-RPTTCLRec-FlipTCLRec-FDATCLRec
BeautyHR@50.03100.04300.04440.0514
HR@100.04730.06450.06520.0734
HR@200.06910.09270.09080.1019
NDCG@50.02050.02830.02990.0359
NDCG@100.02580.03520.02660.0430
NDCG@200.03120.04230.04310.0502
SportsHR@50.01280.01960.02100.0306
HR@100.01970.03140.03140.0459
HR@200.03290.04690.04670.0674
NDCG@50.00810.01320.01360.0209
NDCG@100.01030.01700.03140.0257
NDCG@200.01370.02090.02080.0311
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Fan, Y.; Sheng, T.; Wang, A. Temporal-Aware and Intent Contrastive Learning for Sequential Recommendation. Symmetry 2025, 17, 1634. https://doi.org/10.3390/sym17101634

AMA Style

Zhang Y, Fan Y, Sheng T, Wang A. Temporal-Aware and Intent Contrastive Learning for Sequential Recommendation. Symmetry. 2025; 17(10):1634. https://doi.org/10.3390/sym17101634

Chicago/Turabian Style

Zhang, Yuan, Yaqin Fan, Tiantian Sheng, and Aoshuang Wang. 2025. "Temporal-Aware and Intent Contrastive Learning for Sequential Recommendation" Symmetry 17, no. 10: 1634. https://doi.org/10.3390/sym17101634

APA Style

Zhang, Y., Fan, Y., Sheng, T., & Wang, A. (2025). Temporal-Aware and Intent Contrastive Learning for Sequential Recommendation. Symmetry, 17(10), 1634. https://doi.org/10.3390/sym17101634

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop