1. Introduction
As a central component of data-driven marketing decisions, personalized recommender systems are crucial for optimizing the user experience. Amidst widespread digitalization, user behavior data is becoming increasingly complex and dynamic. Sequential Recommender System is a key subfield of personalized recommender systems that aims to accurately predict future user behavior by modeling the temporal dependencies in user interactions [
1]. However, a fundamental challenge in this field is how to precisely capture the evolution of user interests from massive sequential interaction data. This limitation directly impedes the capacity of marketing analytics systems for efficient prediction and real-time intervention.
From a systems thinking perspective, a user’s interaction sequence is not merely a linear stream of events but a complex adaptive system. Within this system, a user’s decision-making process is often influenced by factors across multiple time scales. This requires a sequential recommender model to adopt a holistic view, enabling it to simultaneously comprehend a user’s immediate contextual needs, recent interests, and long-term stable preferences. The complex interplay of these multiple time scales already lies beyond the scope of traditional methods.
Early Markov Chain [
2] models laid a foundation for sequential modeling, but their limited ability to capture long-range dependencies makes them insufficient for complex user behavior sequences. In response, deep learning-based models for sequential recommendation have emerged and undergone rapid development. Among these, Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), were first introduced to capture temporal dependencies [
3]. More recently, Transformer-based models, leveraging the self-attention mechanism, have achieved significant success in sequential recommendation due to their powerful parallel processing and long-range dependency modeling capabilities. Concurrently, Convolutional Neural Networks (CNNs) offer another effective approach for sequence modeling, with unique advantages in extracting local features and hierarchical information [
4]. Specifically, Temporal Convolutional Networks (TCNs) [
5] have shown considerable potential by using causal and dilated convolutions to capture long-term dependencies in time-series data [
6].
However, with the growing prominence of Transformer [
7] and MLP [
8] architectures, the focus on CNNs within the sequential recommendation field has gradually diminished. TCNs, as a convolutional architecture suited for sequential data, can theoretically capture long-term dependencies effectively. This capability stems primarily from their use of dilated convolutions, which allow the receptive field to expand exponentially with network depth [
9]. Yet, existing TCN-based models face several challenges in practical applications [
10]. First, many TCN models do not fully leverage their potential to expand the receptive field within recommendation scenarios. Traditional fixed-size convolutional kernels and static dilation strategies struggle to flexibly capture dependencies across the diverse time spans found in different user behavior sequences. Second, existing TCN architectures tend to focus on sequence orderliness while failing to explicitly model key temporal attributes, particularly the specific time intervals between interactions. For business intelligence systems, such temporal information is critical for understanding customer behavior and making timely decisions [
11]. User behavior is not only influenced by item order but is also intricately linked to the precise timestamps of interactions, the duration of intervals between actions, and the dynamic evolution of user interests across multiple time scales. Overlooking this fine-grained temporal information can lead marketing analytics systems to misinterpret user intent. For instance, a high density of interactions in a short period may indicate strong current interest, whereas an interaction after a long interval could signify a shift in interest or the emergence of a new need, often presenting an opportune moment for marketing interventions.
To address the aforementioned limitations and account for dependencies across different time scales, we propose TimeWeaver, a dual-stream temporal convolutional model designed to differentiate between time scales. The model is composed of two parallel processing streams: a context stream and a dynamic stream. On the one hand, some user interests are persistent and can span multiple short-term interaction clusters, which requires the model to have the capacity to capture these long-term trends. To achieve this, the context stream incorporates a modern TCN architecture, leveraging its large receptive field to effectively extract stable feature representations. On the other hand, user behavior sequences also contain short-term interests that reflect immediate needs, necessitating a capacity for responsive real-time prediction. For this purpose, we introduce an Exponential Moving Average (EMA) method into the dynamic stream to enhance the model’s sensitivity to short-term fluctuations, thereby capturing rapidly changing information from recent user interactions more accurately.
The remainder of this paper is organized as follows.
Section 2 reviews the related literature on sequential recommender systems and time-aware modeling.
Section 3 provides a detailed description of our proposed TimeWeaver model, including its key components: the time-aware augmentation mechanism and the dual-stream architecture.
Section 4 presents our extensive experimental evaluation, including comparisons with state-of-the-art baselines, ablation studies, and hyperparameter analysis. Finally, we conclude the paper and outline potential future work in
Section 5.
The main contributions of this paper are as follows:
We propose a novel convolutional mechanism and a dual-stream network architecture to effectively model the diverse evolution of user behaviors across long- and short-term timescales. This architecture enables the model to retain long-term historical patterns while precisely capturing recent changes in user interests.
To address the issue of insufficient temporal sensitivity, we introduce a positional encoding calibration method based on temporal features. This approach adjusts the global temporal offset of positional encodings via a learnable scaling factor, which embeds the temporal distribution properties of the data into the sequence representation. This significantly enhances the model’s sensitivity to temporal dynamics.
We have developed a prototype tool, which is available at
https://github.com/AraVio/TimeWeaver (accessed on 27 September 2025). Extensive experiments on three public datasets demonstrate that TimeWeaver outperforms existing state-of-the-art methods across various evaluation metrics. Furthermore, through ablation studies and visualization analyses, we verify the effectiveness of our proposed dual-stream architecture, modern convolutional structure, and time-aware augmentation mechanism.
2. Related Work
2.1. Sequential Recommender System
A Sequential Recommender System aims to predict a user’s future interests by capturing the sequential patterns in their historical interactions. Early approaches primarily relied on Markov chains [
2,
12,
13], which model first-order or higher-order transition probabilities to capture dependencies. However, these methods have significant limitations in modeling sparse data and long-term dependencies. The rise of deep learning marked a fundamental paradigm shift in sequential recommender systems, as neural networks can learn high-level, abstract representations of dynamic user interests from complex interaction sequences.
RNNs and their variants, such as GRUs and LSTMs, were widely adopted first in this field due to their architectural alignment with sequential data. A landmark contribution is GRU4Rec [
3], which first applied GRUs to model user behavior sequences and effectively captured the evolution of user interests. Subsequently, researchers proposed numerous enhancements to the RNN architecture, such as incorporating attention mechanisms to weigh the importance of different historical items. Around the same time, the Transformer model, based on self-attention, revolutionized the field of sequence modeling. The pioneering work SASRec [
14] employed a unidirectional self-attention mechanism to efficiently capture long-range dependencies in parallel. Following this, BERT4Rec [
15] drew inspiration from the masked language model in natural language processing, which uses a bidirectional Transformer to learn deep contextual relationships between items. The success of these models has established self-attention as a core technology in sequential recommender systems.
Furthermore, researchers have explored using the strong local feature extraction capabilities of CNNs to model sequential patterns. Caser [
4] stands as a pioneering work in this area. It innovatively represents a user’s recent interaction sequence as an “image” and applies both horizontal and vertical convolutional kernels to capture sequential patterns and latent item associations, respectively. To overcome the receptive field limitations of traditional convolutions for long sequence modeling, NextItNet [
6] drew inspiration from the TCN architecture. It introduces stacked causal dilated convolutions, which exponentially expand the model’s receptive field without increasing computational cost, thereby enabling the capture of both long- and short-term dependencies. More recent research has trended towards fusing CNNs with other architectures to leverage their complementary strengths. For example, DACNN [
16] constructs dual user-item interaction sequences and models them with an attention mechanism to capture richer dynamic information. In another example, AdaMCT [
17] designs parallel CNN and Transformer branches, which use an adaptive gating mechanism to dynamically fuse local patterns from the CNN with global dependencies captured by the Transformer, striking a balance between performance and efficiency.
More recently, the landscape of sequential recommendation has been significantly influenced by the advent of Large Language Models (LLMs). This emerging trend generally involves two main approaches: using LLMs to generate auxiliary features for traditional models, or employing LLMs directly as the recommendation engine [
18]. In the latter approach, user interaction histories are often formatted as natural language prompts to predict the next item. While LLMs offer powerful reasoning capabilities, they face challenges in efficiently processing long and complex user sequences, often requiring sophisticated retrieval-augmented techniques to select the most relevant historical interactions [
19].
2.2. Time Aware for Sequential Recommender System
Temporal information is a critical factor in Sequential Recommender Systems, as temporal context significantly influences user interests and behavior patterns. Early sequential recommender systems models focused primarily on item order while ignoring the specific timestamps of interactions. This simplification limited the models’ ability to comprehend the dynamic nature of user preferences. With advances in the field, researchers recognized the importance of the temporal dimension and began exploring effective ways to integrate it into model architectures.
Initially, research on time-aware sequential recommender systems centered on modeling time intervals. This involved introducing the time gap between adjacent user interactions as an additional feature, which was processed through simple linear transformations or embedding layers. However, this direct feature-concatenation approach often fails to capture the deeper, more complex interactions between temporal information and user interests. Following the application of the Transformer architecture in sequential recommender systems, researchers began to incorporate time into the self-attention mechanism. A representative work, TiSASRec [
20], integrates time intervals into the attention weight calculation by constructing and encoding a time-difference matrix. While this approach improved the model’s temporal awareness, it also substantially increased its complexity and computational cost.
Subsequent research has further diversified the approaches for time-aware modeling. For instance, MEANTIME [
21] employs a multi-head attention mechanism to capture varied temporal patterns within behavior sequences, while TimelyRec [
22] adopts a hierarchical strategy to model user behavior at different time granularities. TAT4SRec [
23] utilizes an encoder-decoder architecture that separately models timestamps and interacted items, integrating this information during the decoding stage to generate time-aware recommendations. Other works have focused on creating dynamic item representations; for example, DIDN [
24] introduces a dynamic intent-aware module to construct evolving item embeddings by incorporating temporal order. Another novel direction explores the distribution of time intervals themselves, where [
25] proposes data augmentation techniques to transform sequences with irregular time gaps into more uniform ones, thereby improving model performance. Although these methods have advanced the models’ capabilities for temporal awareness, most are still constrained by complex attention mechanism designs and high computational costs.
In summary, while these studies have achieved significant progress in sequential recommendation, several key limitations persist. First, Transformer-based models, despite their power, are computationally expensive. Their self-attention mechanism applies a uniform approach to all historical items, making it difficult to distinguish between long-term stable preferences and short-term interest drifts. This challenge is further exacerbated in recent Large Language Model-based approaches. Despite their advanced reasoning capabilities, these models are often limited by the efficiency and scalability required to process long user histories. Second, most CNN-based architectures, including modern TCNs, lack a dedicated mechanism for explicitly modeling dependencies across different time scales. They typically apply a single convolutional structure to the entire sequence, which is insufficient for capturing the complex interplay between short-term and long-term user interests. Furthermore, existing time-aware methods often integrate temporal information through complex modifications to the attention mechanism or simple feature concatenation. These approaches can either increase model complexity or fail to fully capture the nuanced influence of time intervals on user behavior.
Our proposed TimeWeaver model is designed to directly address these shortcomings. It introduces a dual-stream architecture to overcome the limitations of single-paradigm models. Moreover, our novel Time-Aware Augmentation mechanism performs a dynamic calibration of positional encodings. By learning from the global temporal distribution properties within the data, this mechanism enriches the sequence representation with nuanced temporal dynamics without fundamentally altering the downstream network architecture.
3. Method
3.1. Problem Statement
In a Sequential Recommender System, we define a set of users and a set of all items .
The historical interactions for each user in the system are recorded as a chronologically ordered sequence. This sequence consists of a sequence of items and a corresponding sequence of timestamps .
Here, is the -th item in user s interaction history, and is the precise time of that interaction, satisfying . The length of user s interaction history is denoted by .
The objective of time-aware sequential recommendation is to predict the item a user is most likely to interact with at the next timestep, given the user’s historical item sequence
and timestamp sequence
. This task can be formally formulated as learning a conditional probability distribution:
where
is any candidate item from the entire item set.
3.2. Model Overview
To address the limitations of existing methods in modeling multi-scale temporal dependencies, we propose the TimeWeaver model, which is illustrated in
Figure 1a. The model’s overall workflow comprises three main stages: Time-Aware Augmentation, Dual-Stream Encoding, and Final Prediction.
First, the model processes the input item sequence and timestamp sequence in an embedding and augmentation stage. This stage fuses the item embeddings with temporally-adjusted positional embeddings produced by our proposed Time-Aware Augmentation mechanism. This mechanism adjusts positional information by leveraging the time intervals between interactions, which generates an initial sequence representation that is sensitive to temporal dynamics. This representation then serves as the input to the subsequent encoding modules.
Next, is fed into a dual-stream encoder, which consists of a stack of identical modules. In each encoding layer, the hidden state from the previous layer is processed in parallel by a context stream and a dynamic stream. The context stream employs a modified TCN structure to better capture the user’s long-term preferences, while the dynamic stream leverages an EMA mechanism to sensitively capture rapidly changing short-term interests. The outputs from these two streams are then fused by a Stream Weaver and combined with a residual connection to produce the updated layer representation .
After passing through all encoding layers, the model uses the last vector from the final sequence representation to compute an inner product with the item embedding matrix. This procedure yields the final predicted probability distribution for the next item.
3.3. Time-Aware Augmentation
Given a user’s item sequence and its corresponding timestamp sequence , the model first converts each item into a -dimensional vector representation using an item embedding matrix . To incorporate sequential information, we also employ a standard positional embedding matrix to generate an encoding for each position . Here, is the maximum sequence length supported by the model, and sequences exceeding this length are truncated.
The core of this mechanism is the dynamic calibration of positional encodings using temporal information. First, we discretize each raw timestamp
into an integer index within the range
by applying a modulo operation:
where
is the upper bound for the time index. Subsequently, a time embedding layer maps these discrete time indices
to vector representations, yielding
.
We then introduce a global position offset strategy, which computes the mean of the time embeddings at each position
across all sequences within a batch. For a batch
, let
denote the time embedding at position
of the
-th sequence. The batch-level average time embedding for position
is then:
This “global position offset” strategy is central to our time-aware augmentation. The resulting vector
does not represent the temporal information of a single sequence; rather, it captures a shared temporal pattern across all sequences in the batch at a specific position
i. For instance, if user interactions at the beginning of sessions typically exhibit long time intervals, the average time embedding will encode this pattern. Conversely, if interactions toward the end of sequences are usually rapid, the corresponding average vector will reflect this higher frequency of interaction. We then scale this mean vector by a learnable scalar parameter
to adjust the original positional encoding:
The parameter is learned adaptively during training to dynamically regulate the influence of temporal information on the positional encodings.
Finally, the temporally-augmented sequence representation is formed by fusing the item embeddings with the calibrated positional encodings. By adding this shared temporal pattern back to the original positional encoding, we are effectively calibrating it. The standard positional encoding
only conveys order. Our calibrated encoding
, however, conveys richer, time-aware information: “this is the
i-th item, and it is typically associated with this specific temporal pattern.” This allows the model to better distinguish between positions based not only on their order but also on the behavioral patterns associated with that order. To stabilize the training, we apply Layer Normalization and Dropout:
Here, is the resulting initial sequence representation, which serves as the input to the dual-stream encoder. In addition to preserving order, a key aspect of this mechanism is the effective integration of the data’s global temporal distribution properties into the sequence representation.
3.4. Dual-Stream Encoder
The dual-stream encoder consists of stacked identical layers, taking the time-aware augmented sequence representation as its initial input. Within each encoding layer , the output from the preceding layer (where ) is fed in parallel into two specialized streams: a context stream and a dynamic stream. The outputs of these two streams are then fused and integrated with a residual connection, which produces the layer’s final output .
3.4.1. Context Stream
The context stream is designed to efficiently capture medium- and long-term contextual information within user behavior sequences. Its core is a modern temporal convolution module, which we term TCNNext. The design of this module is inspired by “modern convolution,” as it adopts the block-like structure of Transformers. It also employs large-kernel convolution to achieve a vast effective receptive field, enhancing its ability to model long-term dependencies, as illustrated in
Figure 2. The TCNNext module leverages recent advances in modern convolutional network design and is optimized for sequential recommendation tasks.
For the
-th layer in the encoder, the context stream takes the output from the previous layer,
, as its input. It then refines dynamic patterns in the sequence through a series of specialized operations. As shown in
Figure 1b, the TCNNext module consists of three key sub-layers connected in series. Each sub-layer is followed by a residual connection and layer normalization to ensure stable information flow and training convergence.
The first sub-layer of the module is a depthwise separable convolution. Unlike standard convolution, this operation independently applies a size-adaptive large-kernel depthwise convolution to each feature dimension (see
Section 4.4 for details). This approach effectively expands the receptive field at a low computational cost, allowing it to capture longer-range dependencies. This convolution is immediately followed by a batch normalization layer. Specifically, the input
first passes through this convolutional layer. A residual connection to the original input is then added, and the result is passed through layer normalization to yield the intermediate representation
:
Here, denotes the combination of large-kernel depthwise separable convolution and batch normalization.
The second sub-layer is a parallel convolutional interaction module designed for the complex integration of temporal features. As formalized in Equations (7)–(9), the input
is processed by two parallel convolutional branches. Their outputs are subsequently combined using a gating mechanism, which facilitates a dynamic and non-linear fusion of local contextual information captured from different receptive fields.
where
denotes the GELU activation function and
represents element-wise multiplication.
The final sub-layer is a standard position-wise feed-forward network (FFN), identical to the FFN architecture used in Transformers. It consists of two linear transformation layers with an intermediate GELU activation function. This FFN applies an independent non-linear transformation to the representation at each time step, thereby enhancing the model’s representational power. A residual connection is added between the input
and the FFN’s output, followed by layer normalization. This yields the final output of the context stream for the current encoding layer,
:
By stacking these three sub-layers, the context stream comprehensively extracts rich dynamic interest features from the user sequence, spanning from local to medium-term dependencies.
3.4.2. Dynamic Stream
Unlike the context stream, which focuses on modeling long-term dependencies, the dynamic stream is designed to accurately capture immediate fluctuations in user interest. Its core design principle is a high sensitivity to recent user behavior. A user’s immediate needs are often dominated by their most recent interactions; therefore, the model must assign greater weight to these recent signals.
At the
-th layer of the encoder, the stream receives the output from the preceding layer,
, and first processes it using an EMA. This mechanism aims to suppress noise arising from short-term interactions by recursively smoothing sequence features, thereby reinforcing and preserving persistent signals that span multiple interaction clusters. The traditional recursive formulation of EMA is:
where
is the smoothed representation at timestep
and
is the smoothing factor. However, this recurrent definition is not amenable to efficient implementation on modern parallel computing architectures.
Therefore, we adopt a non-recurrent, vectorized computation scheme that is mathematically equivalent to the recursive form but enables full parallelization. For an input sequence representation , where is the batch size, is the sequence length, and is the feature dimension, the vector at each timestep in the EMA-smoothed sequence is computed via a normalized, weighted cumulative sum.
We define a vector of decay exponents and use it to generate two key weight vectors. The first is a normalization weight vector, , where each element is calculated as . The second is a primary weight vector, , which is derived from . Its first element is identical to the first element of , while all subsequent elements are their counterparts in multiplied by a decay factor .
The smoothed sequence
is obtained by first computing a weighted cumulative sum
of the input sequence
using the weight vector
. The resulting sequence is subsequently normalized element-wise by the vector
, as formalized in Equations (12) and (13):
where
represents the prefix sum operation along the temporal dimension, and
and
denote element-wise multiplication and division, respectively.
To further enhance the representational power of the dynamic stream, the EMA-smoothed features are passed through a linear projection layer. This projection aligns the features with the space learned by the context stream. Thus, the output of the dynamic stream at layer
, denoted as
, is formulated as:
Here, denotes the parallelized exponential moving average operation. The terms and are the weight matrix and bias vector, respectively, of the linear layer in the dynamic stream at layer .
3.4.3. Stream Weaver
To synergistically integrate the complementary information extracted by the two parallel streams, we introduce a core fusion module at the end of each encoding layer, which we term the “Stream Weaver”. This module is designed to effectively aggregate feature representations from the two different temporal scales.
Specifically, for the output of layer
, the representation from the context stream,
, and the dynamic stream,
, are first concatenated along the feature dimension. This combined representation is then fed into a linear layer. This layer utilizes learnable parameters to adaptively weight the features from each stream and projects the fused information back to the original hidden dimension. Finally, we incorporate a residual connection from the previous layer’s input,
, followed by layer normalization. This entire process is formulated as:
where
denotes the concatenation operation along the feature dimension. The terms
and
are the weight matrix and bias vector, respectively, for the fusion linear layer at layer
.
3.5. Prediction Layer
After passing through the
layers of the dual-stream encoder, the model yields the final hidden state representation of the sequence,
. To predict the user’s next behavior, we take the vector corresponding to the last time step, denoted as
, which aggregates the user’s dynamic long- and short-term interests. The preference score
for each candidate item
from the entire item set
is then calculated as the inner product of this final user representation and the item’s embedding vector
. This prediction process is formalized as:
To optimize the model parameters, we employ the cross-entropy (CE) loss function as the training objective, following the standard paradigm for sequential recommendation. The task is framed as a multi-class classification problem: given a user’s historical sequence, the model must correctly classify the next item of interaction from the entire item set
. For any given training instance, the loss is calculated as:
Here, is the ground-truth item that the user interacts with at the next time step. The denominator normalizes the prediction scores over all candidate items via the Softmax function. Minimizing this loss function trains the model to learn sequence representations that accurately predict a user’s future interests.
4. Experiments
4.1. Training Configuration
All experiments were conducted on a hardware platform equipped with a 12-core Intel(R) Xeon(R) Silver 4214R CPU @ 2.40 GHz and an NVIDIA RTX 3080 Ti GPU with 12 GB of VRAM. The software environment consisted of Ubuntu 20.04, and all models were implemented in Python 3.8 using the PyTorch 1.11.0 framework, with GPU acceleration provided by CUDA 11.6.
4.2. Datasets
Our experimental evaluation is conducted on three widely-used Amazon review datasets: Beauty, Sports, and Toys (The datasets are publicly available at
http://jmcauley.ucsd.edu/data/amazon/, accessed on 27 September 2025). These datasets are selected because they are derived from real-world consumer scenarios and exhibit significant differences in item categories, user behavior patterns, and data density. This diversity allows for a robust evaluation of our model’s generalizability across various user preferences and item distributions.
To adapt the data for the sequential recommendation task, we applied a consistent preprocessing pipeline. First, we convert all user behaviors, such as ratings and reviews, into binary implicit feedback (i.e., an “interaction”). Second, we generate a chronological sequence of interactions for each user based on the timestamps. Finally, to mitigate noise from data sparsity, we filter the data by retaining only users and items with at least five interactions. The statistics of the resulting datasets, which are used for model training and testing, are summarized in
Table 1.
4.3. Evaluation Metrics
For evaluation, we adopt the standard leave-one-out strategy common in sequential recommendation. Specifically, for each user’s interaction sequence, the final item is used as the ground truth for the test set, the second-to-last item is used for validation, and all preceding items are used for training. To ensure a rigorous evaluation, we rank the target item against the entire item corpus, rather than employing negative sampling techniques that might introduce bias.
We evaluate model performance using two standard metrics: Hit Rate (HR@k) and Normalized Discounted Cumulative Gain (NDCG@k). HR@k measures the fraction of times the ground-truth item appears in the top-k recommended list, which is equivalent to Recall@k under this setting. NDCG@k is a position-aware metric that evaluates ranking quality by assigning greater importance to items ranked higher. We will report the performance for k values of 5, 10, and 20. For both metrics, higher values indicate better recommendation performance.
4.4. Baselines & Implementation Details
To facilitate a comprehensive evaluation, we compare our proposed model against a set of representative baseline methods that span from classic to state-of-the-art architectures:
CNN-based Methods: We include Caser [
4], which utilizes convolutional operations to extract high-order Markov patterns from user sequences.
RNN-based Methods: We select GRU4Rec [
3], a pioneering model in this domain that uses GRUs to effectively capture temporal dependencies in user behavior sequences.
Transformer-based Methods: This category represents the current mainstream in sequential recommendation. We select three prominent models: SASRec [
14] employs a unidirectional self-attention mechanism to model user sequences; BERT4Rec [
15] uses a bidirectional self-attention mechanism and learns deeper representations via a MLM task; TiSASRec [
20] extends SASRec by incorporating time interval information into the self-attention computation to more accurately model dynamic user interests.
Emerging Architectures: We also include two recent models. FMLP-Rec [
8] is a pure MLP-based architecture that uses a filter module to suppress noise. LRURec [
26] is built upon Linear Recurrent Units, aiming to combine the inference efficiency of RNN-like models with the parallel training capabilities of Transformer-like models.
All experiments are conducted within a unified framework. For all baselines, we tune hyperparameters based on a combination of the recommendations from their original papers and a grid search [
27,
28]. We use the Adam optimizer with a learning rate of 0.001 and a batch size of 256. The embedding dimension is set to 64 and the dropout rate is set to 0.5 for all models. During data processing, the maximum sequence length is truncated to 50. All models are trained from scratch without any pre-trained parameters.
Furthermore, for our model, the kernel size of the depthwise separable convolution layer is adaptively configured based on the maximum sequence length of the dataset. This strategy balances the effective receptive field and computational efficiency across different datasets. Specifically, the base kernel size is set to one-third of the maximum sequence length. If this value is even, it is incremented by one to ensure it is odd. To prevent excessive parameter growth and computational cost on sequences with extreme lengths, we cap the maximum kernel size at 15.
4.5. Overall Performance Comparison
The experimental results in
Table 2 provide a comprehensive performance comparison between our proposed model and various mainstream baselines. It is evident that traditional models like Caser and GRU4Rec consistently lag behind on all datasets. Specifically, Caser uses CNNs to extract local patterns but struggles to capture long-range dependencies in user sequences due to the limited receptive field of its convolutional operations. Similarly, GRU4Rec, which relies on GRUs for temporal modeling, performs reasonably well on short sequences but suffers from information loss when handling longer ones. This observation validates the limitations of traditional sequential models as discussed in our introduction.
In contrast, Transformer-based models, including SASRec, BERT4Rec, and TiSASRec, achieve significant performance gains. This demonstrates the advantage of the self-attention mechanism in capturing complex and long-range dependencies among items. Notably, TiSASRec, which incorporates time interval information, shows a distinct performance improvement over SASRec and BERT4Rec. This result underscores the importance of explicitly modeling temporal dynamics, which is a core motivation for our research.
Among the emerging architectures, LRURec demonstrates highly competitive performance, outperforming most mainstream baselines on the majority of metrics. We therefore consider it a primary baseline for comparison. This suggests that its linear recurrent structure, which combines the advantages of RNN-like and Transformer-like models, is effective for the sequential recommendation task. In contrast, while the pure MLP-based FMLP-Rec performs adequately on some metrics, its overall performance does not surpass the top-performing Transformer and LRURec models, indicating its limitations in finely capturing sequential dependencies.
Finally, our proposed TimeWeaver model achieves state-of-the-art performance, outperforming all baseline methods, including LRURec, across all evaluation metrics on all three datasets (Beauty, Sports, and Toys). Specifically, compared to the strongest baseline, LRURec, TimeWeaver delivers average relative improvements of 4.62%, 9.59%, and 4.59% across all metrics on the Beauty, Sports, and Toys datasets, respectively. This consistent and significant outperformance strongly validates the synergy among our proposed dual-stream architecture, the time-aware augmentation mechanism, and the modern TCN module.
4.6. Ablation Study
To investigate the individual contribution of each core component of TimeWeaver, we conduct a systematic ablation study on the Beauty, Sports, and Toys datasets. By systematically removing or replacing key components, we create several model variants and measure the resulting performance changes to quantify the effectiveness of our design. All ablation variants are trained with the identical hyperparameter configuration as the full model to ensure a fair comparison. We designed the following four ablation variants:
w/o Time: This variant removes the time-aware augmentation module and uses only standard static positional encodings, thus not explicitly modeling the time intervals between interactions. It is designed to evaluate the contribution of explicit temporal information to sequence modeling and interest perception.
w/o Dynamic Stream: This variant ablates the dynamic stream, relying solely on the context stream for sequence modeling. It is used to assess the impact of the EMA mechanism on capturing short-term interest fluctuations.
w/o Context Stream: We remove the context stream and retain only the EMA-based dynamic stream to evaluate the contribution of long-term interest modeling to the overall performance.
w/o ConvInter: This variant removes the parallel convolutional interaction module from the TCNNext block, retaining only the main convolution and feed-forward structures. It is used to evaluate the effectiveness of the complex local temporal feature fusion mechanism.
The performance of each ablation variant is presented in
Table 3. As a general observation, the removal of any single key component leads to a significant degradation in model performance. A detailed analysis is as follows:
Time-Aware Augmentation: On average, removing this mechanism causes a performance drop of 4.04% in HR@20 and 3.37% in NDCG@20 across all datasets. This indicates that static positional encodings alone are insufficient to capture the temporal dynamics of user behavior, and that explicitly modeling temporal features effectively enhances the model’s perception of interest evolution.
Dual-Stream Architecture: Ablating either the dynamic or the context stream results in a clear performance decline across all metrics. This demonstrates that the two streams are complementary in capturing both long-term stable preferences and short-term immediate needs, making the dual-stream design essential for modeling multi-scale interest dynamics.
Parallel Convolutional Interaction: Removing this module leads to a consistent performance drop across all metrics, which highlights its important role in dynamic feature fusion and the extraction of complex temporal patterns.
These results demonstrate that TimeWeaver’s innovative components are critical for synergistically modeling multi-scale temporal features. They contribute significantly to the model’s superior recommendation performance and validate the effectiveness and necessity of our architectural design.
4.7. Hyperparameter Sensitivity Analysis
In this section, we investigate the impact of TimeWeaver’s key hyperparameters on its performance to validate its robustness and guide hyperparameter selection. Specifically, we analyze the sensitivity to two key parameters: the EMA decay rate
, which has a significant impact on the dual-stream architecture, and the time scaling factor
.
Figure 3 illustrates the performance changes in NDCG@20 and HR@20 on the Beauty and Sports datasets as
varies.
The results show that as the EMA decay rate increases, model performance exhibits a downward trend on both datasets. This phenomenon suggests that a smaller decay rate is more effective for modeling the dynamic stream. Although this stream is designed to capture short-term interests, it must retain sufficient historical context to maintain the coherence of the sequence. Conversely, an excessively high value causes the model to overly emphasize the most recent interactions, thereby ignoring the broader session context and impairing recommendation accuracy.
Figure 4 presents the model’s sensitivity to the time scaling factor, which controls the magnitude of the temporal adjustment applied to the positional encodings. The results show that performance peaks when the scaling factor is relatively small. This indicates that a modest scaling factor provides a more effective temporal calibration for the static positional encodings. In contrast, an overly large factor can allow the temporal signal to dominate the original positional information, which negatively affects model performance.
Overall, TimeWeaver demonstrates good robustness to both key hyperparameters, with its performance remaining relatively stable across a reasonable range of their values.
5. Conclusions
In this paper, we proposed TimeWeaver, a time-aware dual-stream network designed to efficiently model temporal dependencies and dynamic changes within user behavior sequences. By leveraging large-kernel convolution, a time-aware augmentation mechanism, and a dual-stream architecture, TimeWeaver effectively captures users’ long-term preferences while remaining highly sensitive to recent changes in their interests. A key contribution of this work is demonstrating that a specialized, dual-stream architecture can effectively resolve the inherent trade-off between capturing long-range dependencies and responding to short-term interest shifts. Experimental results demonstrate that TimeWeaver outperforms existing state-of-the-art models on several public datasets, which validates its superiority in modeling dependencies across multiple time scales. Furthermore, the ablation study and hyperparameter analysis confirmed the effectiveness and significant contribution of each innovative module to the model’s performance.
Despite these promising results, our work has limitations that open avenues for future research. One limitation of the current study is that TimeWeaver, similar to many state-of-the-art sequential recommenders, operates in a transductive setting. Consequently, the model can only recommend items that were present in the training set and cannot inherently handle new items introduced after training, a classic manifestation of the “new item cold-start” problem. Future work could address this challenge by extending TimeWeaver into an inductive framework. A promising direction involves incorporating item side-information, such as textual descriptions or visual attributes. By training a content encoder (e.g., a pre-trained language model) alongside the main model, TimeWeaver could learn to dynamically generate embeddings for new items from their features. This would enable the model to recommend items outside its original training vocabulary, significantly enhancing its practical applicability in dynamic environments like e-commerce, where new products are constantly introduced. In addition to addressing this primary limitation, future work could also involve exploring more complex inter-stream interaction mechanisms or extending this time-aware framework to other recommendation scenarios rich in temporal dynamics.