Non-Stationary Transformer Architecture: A Versatile Framework for Recommendation Systems

: Recommendation systems are crucial in navigating the vast digital market. However, user data’s dynamic and non-stationary nature often hinders their efficacy. Traditional models struggle to adapt to the evolving preferences and behaviours inherent in user interaction data, posing a significant challenge for accurate prediction and personalisation. Addressing this, we propose a novel theoretical framework, the non-stationary transformer, designed to effectively capture and leverage the temporal dynamics within data. This approach enhances the traditional transformer architecture by introducing mechanisms accounting for non-stationary elements, offering a robust and adaptable solution for multi-tasking recommendation systems. Our experimental analysis, encompassing deep learning (DL) and reinforcement learning (RL) paradigms, demonstrates the framework’s superiority over benchmark models. The empirical results confirm our proposed framework’s efficacy, which provides significant performance enhancements, approximately 8% in LogLoss reduction and up to 2% increase in F1 score with other attention-related models. It also underscores its potential applicability across accumulative reward scenarios with pure reinforcement learning models. These findings advocate adopting non-stationary transformer models to tackle the complexities of today’s recommendation tasks.


Introduction
In 2023, global e-commerce not only continued its robust growth, reaching a remarkable $6.3 trillion in sales, an increase of 10.4% from the previous year [1,2], but also witnessed the parallel rise of short video content, with the market for short video platforms anticipated to expand to USD 3.24 billion by 2030 [3].In this rapidly evolving environment, effectively recommending relevant products and engaging video content to consumers has become vital.Consequently, the field of recommendation systems has witnessed accelerated development, with various platforms, from online marketplaces to news aggregation sites and short video platforms, deploying algorithms tailored to their unique requirements.These algorithms are often rooted in advanced methodologies such as deep learning-based models for predicting product click-through rates (CTR), such as deep factorisation machines (DeepFM) [4], wide & deep [5], deep interest evolution network (DIEN) [6], behaviour sequence transformer (BST) [7], and reinforcement-learningbased models for sequential product recommendations, such as Exact-k [8] and traditional deep reinforcement learning algorithms.Double deep Q-network (DDQN) [9], proximal policy optimisation (PPO) [10], and deep deterministic policy gradient (DDPG) [11] aim to maximise overall utility.
The diversity in data dimensions across products, users, and interactions inherently introduces significant disparities within recommendation systems.Traditionally, these models, such as DIEN [6] and BST [7], standardise data into a normalised, quasi-normal distribution, which inadvertently ignores non-stationary data characteristics while enhancing computational efficiency.This loss of nuanced information can significantly diminish the effectiveness of recommendations, as models trained on stable datasets often perform well only within those controlled environments, which lack generalisation capability in real-time complicated recommendation systems.Therefore, finding a solution to the intricacies of non-stationary data is vital.Scholars across various fields have explored it, particularly in time-series modelling [12,13] and quantitative finance [14].In these areas, the focus has often been on adapting models to handle the unpredictable nature of data over time, thereby enhancing predictive accuracy and robustness.In the specific context of recommendation systems, while there has been some focus on mitigating the effects of non-stationarity through techniques like data pruning [15,16], less attention has been given to restoring data to its original, untransformed state.This gap highlights an opportunity for innovation: developing methodologies that accommodate and capitalise on the inherent variability within data.By embracing non-stationarity, it may be possible to construct more adaptive, robust recommendation models that reflect the dynamic nature of user preferences and product features.
This paper addresses the challenge of preserving the intrinsic value provided by non-stationary data within recommendation systems.By introducing a novel mapping relationship within the non-stationary transformer architecture, we aim to enhance model robustness without sacrificing the utility of transformed data in different recommendation system tasks.In our experiments on the deep learning dataset Tenrec [17] and the reinforcement learning dataset RL4RS [18], models incorporating our non-stationary transformer architecture consistently excelled in their respective tasks.Specifically, in the context of Tenrec, we observed significant improvements in Area Under Curve (AUC) and LogLoss metrics.Meanwhile, in RL-based tasks with the RL4RS dataset, the average cumulative rewards achieved by our models substantially surpassed those of the original reinforcement learning models.These results underscore the robust applicability and enhanced performance of our proposed framework across different domains of recommendation systems.The structure of this paper is organised into subsequent sections: Section 2 reviews related work on recommendation systems, spotlighting various methodologies currently employed to navigate the complexities of non-stationary data.Section 3, the methodology part, explains the construction of our model, detailing its integration within deep learning-based models and its application in reinforcement learning paradigms.Sections 4 and 5 present an analytical discourse on our experimental findings, where, through a series of diverse datasets, we validate our model's efficacy in predicting clickthrough rates within deep learning contexts and its proficiency in forecasting cumulative rewards in reinforcement learning scenarios.We demonstrate our model's superior performance on test datasets through comparative analysis, substantially enhancing its generalisation capabilities.Section 6 concludes our current discourse, reflecting on the work's achievements while acknowledging its limitations and proposing directions for future research.

Related Work
We review the previous work related to recommendation systems from three perspectives.Initially, we discuss models based on deep learning, which have been widely adopted for their predictive accuracy and scalability.Following this, we explore reinforcementlearning-based recommendation systems, highlighting how these models adapt and optimise user engagement over time.Finally, we discuss previous works on processing non-stationary data and introduce our non-stationary transformer architecture to address the challenges of non-stationary environments in recommendation systems.

Deep Learning-Based Recommendation System
The evolution of machine learning algorithms towards deep learning has significantly influenced the development of recommendation system models.Traditionally grounded in machine learning and statistical algorithms, such as collaborative filtering [19], alternating least squares (ALS) [20], and factorisation machines (FM) [21], the field has witnessed the emergence of deep learning-based extensions that enhance these foundational models.The factorisation-machine-supported neural network (FNN) [22] represents one of FM's earliest deep learning expansions, laying the groundwork for subsequent innovations.For instance, the Wide & Deep model [5] merges the basic linear regression model with a multi-layer perceptron (MLP) deep learning network, thereby harnessing memorisation and generalisation capabilities.Building on these advancements, deep factorisation machines (DeepFM) [4] optimises the network structure further, integrating FM and deep neural networks to improve prediction accuracy.Attentional factorisation machines (AFM) [23] introduce an attention mechanism into the network, enabling the model to focus on relevant features dynamically.Additionally, graph factorisation machines (GraphFM) [24,25] incorporate graph neural network modules, enhancing the model's ability to leverage complex relational data.Parallel to these developments, deep learning-based sequence recommendation algorithms have also evolved, primarily leveraging recurrent neural networks (RNNs).The deep interest network (DIN) [26] enhances basic sequence models with an attention mechanism, focusing on capturing evolving user interests.This concept is further extended by the deep interest evolution network (DIEN) [6], which divides the model into layers for user behaviour sequences, interest extraction, and interest evolution, addressing the dynamic nature of user preferences.The introduction of the transformer model led to the development of the behaviour sequence transformer (BST) [7], which employs the transformer encoder architecture to integrate user historical interaction data with user and item features.Furthermore, Bidirectional Encoder Representations from Transformer for Recommendation (BERT4Rec) [27] adapts the Bidirectional Encoder Representations from Transformer (BERT) [28] model's capabilities for recommendation systems to showcase the adaptability of deep learning innovations in this domain.These models update their structure from RNN to improve robustness in prediction but lack details on recovering the data to their original status.

Reinforcement-Learning-Based Recommendation System
In recommendation systems, reinforcement learning has been innovatively adapted from classical RL algorithms to address this field's unique challenges and dynamics.These approaches leverage real-time user feedback and interactions to continually refine and optimise recommendation strategies, embodying a proactive approach to user engagement and satisfaction.This adaptation has led to the evolution of specialised models, branching from foundational RL algorithms like deep Q-Network (DQN) [29], DDPG, and Soft Actor-Critic (SAC), each contributing distinct methodologies and applications within the recommendation ecosystem [30].Adapting the DQN [31] framework has given rise to innovative models such as the deep recommender system (DEERS) [32], which models user interactions as a Markov decision process (MDP), dynamically adapting to positive and negative feedback to refine the user experience.The deep reinforcement learning framework for news recommendation (DRN) [33], tailored for personalised online news recommendations, optimises future rewards by accounting for the fluctuating nature of news preferences and user engagement patterns.Integrating social network insights in the social attentive deep Q-network (SADQN) [34] addresses data sparsity and cold start challenges, offering more precise recommendations by harnessing social influences and individual preferences.The deep reinforcement learning for online advertising impression in recommender systems (DEAR) [35] employs DQN to balance ad revenue generation with user experience in advertising, making strategic ad inclusion and placement decisions.Within the DDPG framework, models like DeepPage [36] utilise deep reinforcement learning to optimise digital page layouts, responding dynamically to user feedback to boost engagement on e-commerce platforms.The knowledge-guided deep reinforcement learning (KGRL) [37] system combines RL with knowledge graphs, employing an actor-critic architecture to improve decision-making in interactive recommendation systems, surpassing traditional methods in performance.The supervised deep reinforcement learning recommendation framework (SRR) [38] addresses the challenge of prioritising top recommendations by blending supervised learning with RL, enhancing the efficacy of top-position recommendations without compromising long-term goals.The SAC [39] algorithm has inspired models such as multi-agent soft signal-actor (MASSA) [40], which adopts a multi-agent cooperative RL approach for optimising module rankings in e-commerce settings without inter-module communication.Employing a signal network to generate coordination signals, this model fosters global policy exploration.Similarly, multi-agent spatio-temporal reinforcement learning (MASTER) [41] leverages a multi-agent spatio-temporal RL framework to recommend electric vehicle charging stations, considering long-term spatial and temporal dynamics and demonstrating enhanced performance in practical applications.The extension models from DQN, DDPG, and SAC are applied to different specific recommendation scenarios, focusing on adding more specific blocks to improve the prediction based on the stationary features.
In recommendation systems' evolving domain, models based on deep learning and reinforcement learning typically process data into a standardised, stationary format, approximating a normal distribution to streamline model analysis.This standardisation, however, can inadvertently strip away unique data characteristics, potentially affecting analytical outcomes.Existing research has tried tailoring recommendation systems to adapt to the inherently non-stationary nature of user preferences and behaviours, introducing solutions to maintain system efficacy in such variability.Ye et al. introduce an adaptive case that employs a novel pruning algorithm for large-scale recommendation systems grappling with non-stationary data distributions, effectively balancing model adaptability and computational efficiency [15].Huleihel et al. extend collaborative filtering techniques to accommodate the temporal variability in user preferences, enhancing recommendations' relevance and personalisation [42].Wu et al. propose a two-tiered hierarchical bandit algorithm to navigate the exploration-exploitation trade-off in environments characterised by non-stationarity and delayed feedback, facilitating more timely and contextually appropriate recommendations [43].Chandak et al. address the challenge of delayed feedback in such settings with a stochastic, non-stationary bandit model that leverages intermediate observations to refine learning processes and decision-making [44].Despite the notable advancements these studies contribute to the field, their application tends to be constrained by the specific contexts for which they were developed, limiting their generalisability.Our research aims to bridge this gap by reinstating the non-stationary attributes of data within a more universally applicable recommendation system framework.By integrating our model within both deep learning and reinforcement learning paradigms, we improved performance over traditional models that rely on stationary data processing.This enhancement not only underscores the robustness of our approach but also its versatility across a broad spectrum of complex data scenarios.

Methodology
In this section, we introduce the architecture of our non-stationary transformer [12,14], initially presenting its foundational structure.Following this, we combine it within deep learning and reinforcement-learning-based recommendation systems, demonstrating the model's versatility and wide applicability across various recommendation scenarios.

Projector Layer
The initiation point of our non-stationary transformer is the projector layer, essentially a framework designed to detect and adapt to the evolving patterns within sequential datasets.The adaptation process commences with: where Xreduced represents the dimensionally reduced data obtained through averaging over the temporal dimension T. Each x t corresponds to the data vector at time point t, capturing specific features.The index t runs from 1 to T, indicating the sequential nature of the averaged data points.Following this, the data are processed through a series of transformation layers, each comprising a dense neural network structure with Leaky ReLU activation, encapsulated as: where W hidden and b hidden denote the weights and biases of the hidden layers, respectively.Then the final output, incorporating the essence of the non-stationary features, is rendered through: where tanh represents the hyperbolic tangent function, encapsulating the detected nonstationary aspects.

Transformer Encoder Layer
The transformer encoder layer is at the core of our structure, adding a self-attention mechanism specifically tailored for analysing complex sequential data.Integral to this encoder are the dynamic elements, namely scale_learner and shift_learner.These elements are crucial for adapting to changes in data over time, with the scale_learner adjusting the significance of different temporal features and the shift_learner accommodating shifts in the data's patterns or distributions.Together, they ensure the model's attention mechanism remains attuned to the evolving characteristics of the sequential data, as expressed by: log τ = scale_learner(x raw , σ enc ), ∆ = shift_learner(x raw , µ enc ) where σ enc and µ enc denote the standard deviation and mean of the input sequences, respectively.The adapted attention mechanism in our dynamic structure is formulated to accommodate the intricacies of non-stationary data.Specifically, Q ′ , K ′ , and V ′ represent the stabilised versions of the queries, keys, and values obtained from the original dataset.The attention function is represented as: In Equation ( 5), the operation τ ⊙ (Q ′ K ′⊤ ) effectively scales the dot product of the queries Q ′ and keys K ′ with the scaling factor τ, which is designed to adjust for time-varying aspects of the data.The term ∆ introduces a shift to these scaled scores, further tailoring the attention scores to the non-stationary characteristics of the dataset.The normalisation factor √ d k , where d k denotes the dimensionality of the keys, ensures that the scaled dot products maintain a consistent variance, promoting stable gradients throughout the model.The softmax function is then applied to the resulting scores, converting them into a probability distribution.This step ensures that each value in the interval (0, 1) and the entire vector sum to 1. Finally, the attention scores are applied to the values V ′ through a weighted sum.This multiplication aggregates the information across all values, weighted by their relevance as determined by the attention scores, culminating in the output of the attention mechanism.This output serves as a contextually enriched representation that synthesises the most relative information from the input data, adjusted for both the temporal dynamics and the non-stationary features inherent in the dataset.A layer normalisation and dropout combination is applied to ensure stability and prevent model overfitting.We put the previous attention result in the dropout process and then use the layer normalisation in the process: With its novel approach to handling non-stationary data through adaptive learning and dynamic adjustments, this architecture can be applied to designing transformers for complex and evolving datasets.

Fusion in the Deep Learning-Based Recommendation System
We then integrate our non-stationary transformer architecture into the deep-learningbased recommendation system framework, mainly focusing on refining the BST [7] algorithm.This integration involves replacing conventional transformer layers with our advanced non-stationary transformer modules to better capture temporal dynamics and distributional shifts in user behaviour sequences; see Figure 1.Embedding Layer: The embedding layer initiates the adaptation process by transforming the multifaceted input data into compact, low-dimensional vector representations.The input data are categorised into three principal segments: (1) The core component comprises the user behaviour sequence, encapsulating the dynamic interplay between users and items over time; (2) auxiliary features encompass a broad spectrum of attributes, including user demographics, product specifications, and contextual information, enriching the model's understanding beyond mere interaction patterns; (3) the target item features primarily focus on the characteristics of new or prospective items that are subjects of prediction.Each of these segments undergoes a distinct embedding process, resulting in specialised embeddings that collectively form a comprehensive representation of the multifaceted input data within our model.This embedding strategy is crucial for capturing the nuanced relationships and attributes inherent in user behaviour sequences, auxiliary features, and target items.To preserve the sequential essence of user interactions, we employ positional features that assign temporal values based on the chronological distance between item interactions and the moment of recommendation.

Sigmoid
Non-stationary Transformer Layer: We replace the BST algorithm's standard transformer layers with our non-stationary transformer layers.This substitution improves the model's ability to adapt to temporal variations and data distribution shifts, thereby enabling a deeper understanding of complex inter-item relationships and user interaction patterns within a dynamically changing context.

Multi-layer Perceptron Layers:
The final part of our architecture is marked by a series of MLP layers coupled with a customised loss function designed for the binary classification task of predicting user clicks or the multi-classification task of predicting product scores.This final ensemble leverages the enriched feature set processed through the Non-stationary Transformer layers, facilitating precise and context-aware recommendations.
By adding the non-stationary transformer to the structure of the BST algorithm, our approach retains the original model's capability to process user behaviour sequences.It significantly enhances the adaptability and predictive accuracy of user interaction.This novel integration represents a significant improvement in deep-learning-based recommendation systems, promising superior performance in navigating the complexities of dynamic user behaviour patterns.

Fusion in the Reinforcement-Learning-Based Recommendation System
We embedded our non-stationary transformer architecture into the core of reinforcementlearning-based recommendation systems, specifically choosing the DDQN, DDPG, and SAC frameworks for integration.These frameworks are classic models within the field of reinforcement learning and represent different branches of the discipline, which showcases the versatility of our non-stationary transformer.By leveraging this architecture, we aim to enhance the models' predictive capabilities and robustness, especially given their superior handling of non-stationary data characteristics.With their distinct mechanisms and strengths, the choice of DDQN, DDPG, and SAC provides a broad and comprehensive testing ground to demonstrate our approach's enhanced adaptability and performance across various reinforcement learning scenarios.
Integration with DDQN: Integrating the Non-stationary Transformer within the DDQN framework substantially augments the model's precision in value estimation and policy optimisation (Figure 2).DDQN [9], an extension of DQN [29], introduces a critical improvement by decoupling the selection and evaluation of the action in the Q-value update equation, thereby mitigating overestimation.The standard DQN update equation [45] is given by: where s t and a t are the state and action at time t, r t+1 is the reward received after taking action a t , α is the learning rate, and γ is the discount factor.In DDQN, this is modified to: where Q target represents the action-value function estimated by the target Q-network.In DDQN, we introduce a dual mechanism that significantly boosts the model's ability to process and predict sequential datasets by embedding the Non-stationary Transformer into both the Q-network and the target Q-network.This is particularly relevant for recommendation systems where the goal is to sequentially recommend products on a page, with each recommendation considered an action.The DDQN, enhanced with our transformer, aims to maximise the overall layout's utility, striving for the highest possible number of clicks or transactions through strategic product recommendations based on historical user-item interactions, product characteristics, and user profiles.This enhanced approach allows the DDQN to more accurately anticipate the cumulative rewards associated with different action sequences, optimising the selection of items to present to the consumer at each step.The non-stationary transformer's integration further empowers the DDQN to handle the temporal dynamics and non-stationary nature of recommendation system data, ensuring enhanced performance in environments characterised by rapidly evolving user preferences and interaction patterns.
Integration with DDPG: The augmentation of the DDPG [11] framework with the non-stationary transformer involves strategically embedding this architecture into the actor-network and the critic-network (Figure 3).This integration significantly enhances the model's capacity to interpret and respond to recommendation system tasks' complex, sequential nature.
In the Actor Network, the non-stationary transformer's integration facilitates a more nuanced understanding of the current state, enabling the network to propose actions (e.g., product recommendations) that are not only optimal based on current knowledge but also adaptive to the evolving user preferences and behaviours.The transformer's ability to process temporal sequences and adapt to data shifts allows the actor-network to make more informed decisions, especially in scenarios where user interactions with items change dynamically.Most importantly, it recovers the original non-stationary dataset distribution to reflect the decision-making of the recommendation systems.For the non-stationary transformer integration within the DDPG framework, the actor-network update is defined by the following equation: where ∇ θ µ J represents the gradient of the objective function J with respect to the actor parameters θ µ .This gradient is estimated as the expected value of the product of the gradient of the action-value function Q with respect to the action a, evaluated at the current state s t and the action proposed by the current policy µ(s t ) and θ µ are the parameters of the non-stationary transformer integration DDPG's actor-network.Equation ( 9) allows the actor-network to learn optimal policies over time by ascending the gradient of the performance criterion.For the Critic Network, including the non-stationary transformer, empowers the network to more accurately estimate the future rewards associated with the actions proposed by the actor-network.This is particularly critical in recommendation systems, where the value of an action (e.g., the likelihood of a user clicking on a recommended item) can vary significantly over time.By capturing the temporal dynamics and distributional shifts in user-item interaction data, the critic-network can provide more reliable feedback to the actor-network, leading to continuous policy refinement.In the critic network of the non-stationary-transformer-integrated DDPG model, the loss function L used for training is defined as follows: where r is the reward received after executing action a in state s, and γ is the discount factor that weighs the importance of future rewards.The term Q(s ′ , µ(s ′ |θ µ ′ )|θ Q ′ ) is the actionvalue predicted for the next state s ′ by the target policy µ and the target critic network, parameterised by θ µ ′ and θ Q ′ .Meanwhile, θ Q is the parameters of the critic-network.
Incorporating the non-stationary transformer into DDPG preserves DDPG's advantages by offering smoother policy updates and reducing the variance in policy evaluation while significantly enhancing the model's robustness and adaptability.By leveraging the transformer's ability to process non-stationary data, our adapted DDPG framework exhibits superior performance in capturing dynamic user-item interactions and evolving preferences, which are crucial for making precise and contextually relevant recommendations.
Integration with SAC: The SAC [39] algorithm, known for its stability and efficiency in continuous action spaces, employs an entropy-augmented reinforcement learning strategy that encourages exploration by maximising a trade-off between expected return and entropy.Integrating the Non-stationary Transformer into SAC involves embedding this advanced architecture into both actor networks and critic networks (Figure 4).It enhances their capability to process sequential decision-making tasks by capturing the complex dependencies in user-item interactions.The transformer's ability to handle temporal dynamics and non-stationary data significantly improves the policy's adaptability and the precision of action selection in dynamic recommendation environments.The core of enhanced SAC consists of two actor networks π θ and π θ ′ , and four critic networks , where θ and ϕ denote the parameters of the actor and critic networks, respectively.Including the non-stationary transformer in the four critic networks enables a more nuanced valuation of the state-action pairs, considering the evolving nature of user preferences and item attributes.The objective function for the actor network in enhanced SAC with the non-stationary transformer is given by: where D is the experience replay buffer, α is the temperature parameter that determines the importance of the entropy term, and s t and a t represent the state and action at time t, respectively.This comprehensive understanding of the data's temporal and non-stationary aspects allows for a more accurate estimation of expected returns, facilitating more effective policy updates.The SAC, equipped with the non-stationary transformer, sets a new benchmark for reinforcement-learning-based recommendation systems, particularly in handling the complexities of sequential decision-making and adapting to the dynamic nature of recommendation tasks.Through these strategic fusions, our non-stationary transformer improves the predictive accuracy of reinforcement learning models in recommendation systems.It enhances adaptability and robustness previously unattainable with conventional transformer architectures.This innovative approach promises to redefine the standards of reinforcementlearning-based recommendation systems, accommodating the complex and dynamic nature of real-world user behaviour and preferences.

Datasets
In our experimental analysis, we utilise two distinct datasets tailored to the specific needs of our study: Tenrec [17] for deep learning-based experiments and RL4RS [18] for reinforcement learning-based investigations.
Tenrec Dataset: Derived from Tencent's renowned recommendation platforms, QQ BOW (QB) and QQ KAN (QK), the Tenrec dataset focuses on video recommendations, encapsulating a vast range of user interactions, including clicks, likes, shares, and follows.The QK-video dataset alone boasts over 5 million users and 3.75 million items, resulting in a staggering 142 million clicks, alongside significant volumes of likes, shares, and follows.This extensive dataset, with its diverse feature set covering user demographics and item categories, is anonymised to ensure user privacy.Including both positive and negative feedback provides a holistic view of user preferences, which is pivotal for refining deep learning models within recommendation systems.
RL4RS Dataset: Originating from one of NetEase Games, the RL4RS dataset focuses on reinforcement learning applications in recommendation systems.Comprising two distinct subsets, RL4RS-Slate and RL4RS-SeqSlate, the dataset facilitates the exploration of single-page and sequential slate recommendations to maximise session rewards.The RL4RS-Slate dataset encompasses interactions from 149,414 users across 283 items, resulting in approximately 949 valid item slates per dataset variant, highlighting its complexity and potential for advancing reinforcement learning strategies in recommendation tasks.This comprehensive collection of interaction logs, user behaviours, and item features within the RL4RS dataset offers an unparalleled resource for investigating advanced recommendation strategies and sequential decision-making processes, further enriched by the detailed statistical insights provided.

Deep Learning-Related Experiment
Using PyTorch as the computational framework, we used the Tenrec video dataset for the click-through rate prediction task, dividing it into three subsets: 70% for training, 15% for validation, and 15% for testing.Table 1 provides a summary of these splits.We trained two models on the training set: the baseline BST model and our enhanced version, which incorporated the non-stationary transformer.As depicted in Figure 5, the baseline BST model's loss gradually converges to approximately 0.43, while the non-stationary transformer BST model demonstrates a more significant loss reduction, converging around 0.39.This notable difference indicates that our non-stationary transformer BST model achieves a lower train loss overall, suggesting enhanced learning efficiency.Then, we plan to extend this comparative analysis to the test set to validate the models' performance and ascertain whether the lower train loss translates to improved prediction accuracy on unseen data.During the validation phase of our experiment, we conducted trials to determine the optimal batch size, considering the immense scale of the Tenrec video dataset, which contains over a hundred million records.Guided by Tenrec's introduction, which suggests larger batch sizes for datasets of this magnitude, we experimented with batch sizes in powers of two: 1024, 2048, 4096, and 8192.Figure 6 indicates that larger batch sizes facilitated a more rapid decrease in loss.However, upon evaluating the largest batch size of 8192, we observed the loss diminishing to near zero, indicating a potential risk of severe overfitting.Consequently, based on these observations, we identified 4096 as the most suitable batch size for our experiments, balancing efficient learning with the need to avoid overfitting.We utilise several performance metrics in the test, including LogLoss, AUC, and F1 score.We compare the models' effectiveness with Tenrec data's baseline models, such as wide & deep [5], DeepFM [4], neural factorisation machines (NFM) [46], and deep & cross network (DCN) [47], in which these baseline models are referenced in the original dataset.As Table 2 illustrates, our Non-stationary Transformer BST model not only aligns with but also significantly enhances the performance of the baseline BST model.It achieves an improvement in LogLoss of 8.31%, an increase in AUC of 0.81%, and a rise in the F1 score of 2.79%.Moreover, compared with other benchmark models, our non-stationary transformer BST model demonstrated a clear advantage, yielding the lowest Logloss and the highest scores in both AUC and F1 metrics.Notably, it surpassed the wide & deep model by a substantial margin, with improvements of 14.89% in LogLoss, 0.84% in AUC, and a significant 4.91% in the F1 score.Similar outperformance trends were observed against all metrics across the DeepFM, NFM, and DCN models.These results underscore the exceptional ability of the non-stationary transformer BST model to effectively predict click-through rates, showcasing its strength in handling the complex and dynamic nature of the data inherent in recommendation systems.Integrating the non-stationary transformer within the BST framework enhanced its learning efficiency on the training data.It solidified its robustness and accuracy, making it a superior model for the CTR prediction task in practical applications.

Reinforcement Learning-Related Experiment
For our reinforcement learning experiments, we employed the RL4RS-Slate dataset to train models across three distinct reinforcement learning frameworks: DDQN, DDPG, and SAC.These frameworks represent three classic types of models in the field of reinforcement learning.The experiments were conducted using the TensorFlow 1.15 environment.In each framework, we developed three models: one following the original framework architecture, the second integrating a transformer layer, and the third enhanced with a non-stationary transformer layer.To ensure comparability, we established a set of common hyperparameters for all models.The training configuration was as follows: • epochs: 10,000 For each model during the 10,000 epochs, we stored the optimal parameters.Then, post-training evaluations were conducted on the RL4RS-Slate dataset by recording the performance across 50 episodes for each model variant within the DDQN, DDPG, and SAC frameworks by randomly starting.We collected the maximum and minimum rewards for each episode and the average across the episodes.The results, depicted in Figure 7, exhibit the reward distribution for the original models, those with an added transformer layer, and those enhanced with the non-stationary transformer layer.The box plots in Figure 7 show that models incorporating the transformer layer (marked as "2" in each subplot) display a more concentrated range of rewards, suggesting that the transformer layer contributes to a more effective learning of the sequential patterns present in the dataset.However, this may come at the cost of losing some inherent data characteristics due to a stationary assumption.In contrast, the non-stationary transformer models demonstrate an extended range of maximum and minimum rewards, retaining the data's non-stationary features and indicating their ability to maintain higher reward potential.An additional observation is that the maximum rewards achieved by the DDPG and SAC models frequently surpass 200, a benchmark that only a minority of the DDQN episodes exceed.This outcome is attributed to the policy gradient nature of DDPG and SAC, which enables a more effective exploration of optimal strategies than DDQN.Moreover, SAC exhibits the smallest minimum rewards, often between 3 and 5, suggesting that incorporating entropy regularisation promotes exploration, occasionally leading to suboptimal strategies.We further calculate the mean and standard deviation of average rewards across 50 episodes for each model variant.As detailed in Table 3, the non-stationary transformer models consistently outperformed the other two variants regarding average episodic rewards, irrespective of the underlying reinforcement learning framework employed.The models with the non-stationary transformer layer achieved the highest returns across all frameworks.Specifically, the non-stationary transformer DDQN model exhibited a remarkable average reward mean of 96.71 with a standard deviation of 7.72, while the corresponding figures for the original DDQN and transformer DDQN were 61.93 with a standard deviation of 8.35, and 69.38 with a standard deviation of 6.84, respectively.This pattern of improved performance with the non-stationary transformer layer held for the DDPG and SAC models as well, with the non-stationary transformer DDPG model achieving an average reward mean of 115.08 and a standard deviation of 8.11 and the nonstationary transformer SAC model reaching 100.56 with a standard deviation of 8.71.The observed standard deviations indicate that the non-stationary transformer models achieve higher means and maintain performance stability close to the best-performing variants within each category.This robustness underscores the inherent strengths of the DDPG framework in maximising rewards within the reinforcement learning domain, potentially due to its policy gradient approach, which efficiently navigates the continuous action spaces typical of complex recommendation environments.

Discussion
Our comprehensive experimental analysis includes deep-learning-related and reinforcement learning-based results in recommendation system algorithms.
Deep Learning-related Experiment: The deep learning experiments conducted using the Tenrec video dataset revealed the enhanced performance of the non-stationary transformer BST model over the baseline BST.Notably, the non-stationary transformer BST model achieved lower LogLoss and higher AUC and F1 scores, indicating superior predictive accuracy and classification quality.This improvement suggests that accommodating the non-stationary aspects of user interaction data can significantly enhance the effectiveness of recommendation systems.Moreover, the observed benefits were consistent across various baseline models, emphasising the robustness and generalisability of our proposed approach.
Reinforcement Learning-related Experiment: In the reinforcement learning domain, the RL4RS-Slate dataset experiments demonstrated the effectiveness of incorporating non-stationary transformer layers into the DDQN, DDPG, and SAC frameworks.The nonstationary transformer models consistently outperformed their standard and transformerlayer counterparts regarding average cumulative rewards.This outcome underlines the potential of non-stationary transformers in capturing the sequential decision-making nuances necessary for recommendation systems.The DDPG framework, in particular, showed the highest average reward means, suggesting that integrating non-stationary transformers is highly synergistic with policy gradient methods like DDPG.
These experiments collectively underscore the importance of considering temporal dynamics and non-stationarity in user data when designing algorithms for recommendation systems.The findings advocate for a paradigm shift toward models that process data effectively and adapt to its evolving nature.The superior performance of the nonstationary transformer enhanced models suggests that such architectures could be critical in the next generation of recommendation systems, which will require dealing with increasingly complex user behaviours and preferences.Firstly, this architecture enables the model to capture and leverage the intrinsic characteristics of the original data distribution more effectively.Secondly, data transformation through the non-stationary transformer introduces controlled noise to the training process.This inclusion of noise prevents premature convergence and enhances the generalisation ability of the models across diverse datasets.Future research could explore the scalability of non-stationary transformer models in even larger datasets and their adaptability across different domains.Additionally, investigating the interpretability of these models could yield further insights into the nature of the complex patterns they learn, potentially guiding the design of even more effective recommendation systems.

Conclusions
In this study, we provide a novel, versatile framework that integrates the non-stationary transformer structure with both deep learning and reinforcement learning and presents an extensive examination of advanced model architectures for recommendation systems.The non-stationary transformer BST model demonstrated considerable superiority in deep learning experiments over the baseline BST and other benchmark models.Similarly, when applied to reinforcement learning across DDQN, DDPG, and SAC, the non-stationary transformer enhanced models consistently yielded higher rewards, indicating their robustness and efficiency in sequential decision-making tasks.The success of these models can largely be attributed to their ability to adapt to the non-stationary nature of user interaction data, capturing temporal dynamics that traditional models often overlook.
Despite these strengths, our study has certain limitations.The computational complexity of non-stationary transformer models is considerable, potentially complicating scalability in larger or distributed machine learning environments.Future work will explore integrating non-stationary transformer structures into a broader range of deep learning and reinforcement learning models and assess the feasibility of deploying these complex architectures in distributed systems for recommendation tasks.

Figure 1 .
Figure 1.Illustration of the Enhanced BST Model with Non-stationary Transformer Integration.

,Figure 2 .
Figure 2. Illustration of the Enhanced DDQN Model with Non-stationary Transformer Integration.

Figure 3 .
Figure 3. Illustration of the Enhanced DDPG Model with Non-stationary Transformer Integration.

Figure 4 .
Figure 4. Illustration of the Enhanced SAC Model with Non-stationary Transformer Integration.

Figure 5 .
Figure 5. Different Models Loss among Train Set.

Figure 6 .
Figure 6.Different Batch Sizes Loss among Validate Set.

Figure 7 .
Figure 7.Comparison of maximum and minimum rewards per episode across three reinforcement learning frameworks: DDQN, DDPG, and SAC.For each framework, we compare the original model ("1"), the model with an added transformer layer ("2"), and the model enhanced with the non-stationary transformer layer ("3").

Table 1 .
Summary of the Tenrec Video Dataset Splits.

Table 2 .
Performance Comparison of Different Models.

Table 3 .
Summary of Episode Reward Mean and Standard Deviation for Various Algorithm Implementations.