Next Article in Journal
White Matter Microstructure Differences Between Congenital and Acquired Hearing Loss Patients Using Diffusion Tensor Imaging (DTI) and Machine Learning
Previous Article in Journal
A Hybrid Approach Using Graph Neural Networks and LSTM for Attack Vector Reconstruction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Novel Models for the Warm-Up Phase of Recommendation Systems

by
Nourah AlRossais
Information Technology (IT) Department, College of Computer and Information Sciences, King Saud University, P.O. Box 51178, Riyadh 11543, Saudi Arabia
Computers 2025, 14(8), 302; https://doi.org/10.3390/computers14080302
Submission received: 6 June 2025 / Revised: 20 July 2025 / Accepted: 21 July 2025 / Published: 24 July 2025

Abstract

In the recommendation system (RS) literature, a distinction exists between studies dedicated to fully operational (known users/items) and cold-start (new users/items) RSs. The warm-up phase—the transition between the two—is not widely researched, despite evidence that attrition rates are the highest for users and content providers during such periods. RS formulations, particularly deep learning models, do not easily allow for a warm-up phase. Herein, we propose two independent and complementary models to increase RS performance during the warm-up phase. The models apply to any cold-start RS expressible as a function of all user features, item features, and existing users’ preferences for existing items. We demonstrate substantial improvements: Accuracy-oriented metrics improved by up to 14% compared with not handling warm-up explicitly. Non-accuracy-oriented metrics, including serendipity and fairness, improved by up to 12% compared with not handling warm-up explicitly. The improvements were independent of the cold-start RS algorithm. Additionally, this paper introduces a method of examining the performance metrics of an RS during the warm-up phase as a function of the number of user–item interactions. We discuss problems such as data leakage and temporal consistencies of training/testing—often neglected during the offline evaluation of RSs.

1. Introduction

Recommendation systems (RSs) are extensively used on online platforms to help users discover interesting content in vast catalogs of items. An RS is applicable in various contexts, including information retrieval in academic and scientific contexts, social media, streaming services, and online commerce. Diverse approaches have been proposed to synthesize an RS, and two paradigms have emerged that encompass most algorithms and approaches: content-based filtering (CBF) and collaborative filtering (CF) [1,2]. Hybrid RSs combine these approaches [3]. Extensive research has been dedicated to improving the underlying algorithms that transform user–item metadata and users’ past implicit and explicit item preferences into new recommendations with the recent proliferation of deep and reinforcement learning applications [4,5].
Cold start refers to the operating condition where an RS encounters a new user or newly introduced content in the catalog. These scenarios are challenging, and several methods have been developed to address them [6]. Approaches for extreme cold starts focus on providing recommendations to new users or for new items [4,7]. When new users engage with catalog items or new items receive feedback from existing users, the RS is considered to be operating in a “warm-up” condition. We have minimal information to reliably adopt a full model in the warm-up phase; however, these interactions provide valuable insights that should be considered to improve extreme-cold-start recommendations. The warm-up phase is critical for new users, as it shapes their opinion of the system. Therefore, poor performance during this phase affects the retention of users and content creators on the platform. For example, the dataset provided by Amazon—the largest e-commerce platform on the planet—has fewer than 100,000 users who have completed over 50 purchases with reviews [8]. Furthermore, according to Anand et al. [8], a user profile cannot be formed with fewer than 50 interactions, filtering out users with fewer than 50 purchases and focusing on the fully operational RS. The data, even for such large e-commerce sites, reveal that most users are in the warm-up phase, yet sparse research caters to this phase. The same pattern applies to other datasets (MovieLens and Booking.com). Most users are neither in the cold-start nor fully characterized phase, but they are most likely in the warm-up phase. In the RS literature, a minimal distinction exists between an extreme cold start and the warm-up phase of an RS. In a previous study [9], Zhu et al. recognized the gap between cold-start items and user embeddings (i.e., representations) and embeddings of a full model and proposed a scaling transition between the two. This is specific to embedding-driven deep learning models [10]. The need for a transition mechanism and the antithetic behaviors of cold and warm recommendations in a deep learning setting are also observed. A gap exists in the study and improvement of RS performance during the transition from a cold-start phase to a warm-up phase. To corroborate this statement [6,11], among the approximately 400 references considered in recent reviews, only 2 explicitly handled the cold-start-to-warm-up phase, and none analyzed how performance evolved during the warm-up phase. This contrasts with the real-world business needs of RSs. Thus, Localytics and CleverTap—two online engagement firms operating in important marketplaces such as eBay, Etsy, and Tesco—reported that most customer attrition on online platforms is observed during the first uses within weeks of signing up. Datacolor.ai explained that enhancing an RS for a global streaming service by 10% in its metrics led to a 15% reduction in customer churn rates.
Furthermore, RS performance evaluation has evolved considerably in the last few years. The landscape has changed from an accuracy-oriented evaluation—measuring the closeness of predicted ratings and most enjoyed items—to a broader spectrum of metrics, including fairness, partly driven by the negative press received in some cases where RS biases were made public [12]. Fairness is one of the desirable features that the RS research community has recently focused on [13] to minimize the business risk inherent in an RS biasing itself toward accuracy. The fairness/accuracy tradeoff is now a crucial metric that many studies consider when assessing RS performance [4]. Other metrics unrelated to accuracy, such as serendipity and the ability to expose a user to novel parts of a catalog without overspecializing, have received substantial attention [14].
Ji et al. [15] recently discussed a crucial pitfall in the evaluation literature of RSs—data leaks from most offline experimental evaluations. This effect is twofold: i. user recommendations are not chronologically handled (i.e., user‒item interactions are not sequential); ii. experimental training and tests do not account for unavailable user–item interactions when a given interaction is being studied (i.e., future user–item interactions are used to train an RS that is then evaluated on past user–item interactions). Both data leakage effects introduce unrealistic training samples, leading to inaccurate experimental findings. When a study introduces a strategy for splitting the data into training, test, and validation sets without considering time, the authors [15] argue that the findings are affected by data leakage. The widespread lack of time consistency requirements in published research [15] undoubtedly explains why evaluations do not reveal how RS performance metrics are expected to change during the warm-up transition from a cold start to a fully operational RS.
Given the above context, Table 1 presents the research gaps identified in the literature, as summarized below:
  • RS evaluations that explicitly address data leakage are lacking [15].
  • Studies addressing the metrics of the cold-to-warm RS transition as a function of the number of user–catalog interactions remain lacking.
  • Few studies address potential improvements in the cold-to-warm transition via algorithms that explicitly improve the latter phase. The available studies often cater to deep neural network embeddings and are presented for a new user, not a new-item case.

2. Objectives and Contributions

This study aims to bridge the following research gaps introduced in Section 1:
  • We study and present how existing RS formulations perform during the transition from the extreme-cold-start phase to the warm-up phase. Thus, we ensure that their training and evaluation follow a consistent pattern of chronological information, as discussed in Section 1.
  • We propose a novel approach that is independent of the RS algorithm used to improve the performance of any RS during the warm-up phase.
  • We present the performance of baseline RSs with and without the proposed model(s) and demonstrate how their accuracy, fairness, and serendipity metrics evolve on the basis of the growing number of new-user/new-item interactions.
  • Studies have focused on new-user or new-item problems, treating them as two independent cases, with the new-user problem receiving the most attention. Nonetheless, our formulation similarly applies to both problems and reveals desirable characteristics in new-user and new-item cases.
In the remainder of this paper, we discuss related studies that have directly or indirectly addressed the warm-up phase of a recommendation engine and introduce models that can be applied to any RS to increase performance during warm-up, with minimal limitations on the underlying class of RS algorithms. Subsequently, we characterize the incremental effects of specifically catering to the warm-up phase via a set of RS-agnostic approaches and demonstrate that the suggested approaches, with minimal computational cost, can improve on various recommendation metrics during the warm-up phase of an RS.
In this study, we use a dataset that is rich in user‒item interactions and user and item metadata features and includes timestamped interactions, thereby allowing the new-user and new-item experiments to preserve temporal sequences and prevent data leakage. The dataset and a previous study [32] are discussed in Section 5.

3. Related Work

3.1. Cold Start

The cold start of new-user and -item problems is challenging for any RS, and several methods address such operations [6]. Extreme-cold-start approaches provide recommendations to new users or for new items. In this study, we use the findings of previous studies [4,7] to provide published baselines that comprise three crucial features: (i) state-of-the-art performance, (ii) applicability to any RS formulation, and (iii) not restricting the view to a particular class of solver systems (i.e., factorization-driven, deep neural networks, etc.). Recent studies on cold starts [17,18,19] have focused on the new-user problem (not the new-item problem) and the preference patterns of clustered users to provide recommendations to new users. These cases do not explicitly analyze the evolution of recommendation metrics with the user interactions and do not address the data leakage challenge. Two recent reviews on cold start mentioned the warm-up issue in 2 out of over 400 references, and neither addressed data biases from data leakage or offline experiments with inconsistent timelines. This demonstrates the poor recognition of the problems of the cold-to-warm-up transition and data leakage reported in the current RS literature.

3.2. Warm-Up

In studies on temporal sequences of recommendations, the preference pattern of users is used to generate time- and interaction history-dependent recommendations. Nonstationary time-dependent effects are introduced via a Bayesian approach or a recurrent or graph neural network. Some studies [24,25] have adopted a Markov transition matrix to introduce temporal patterns in their RSs. Some other studies [26,27,28] are recent examples that leverage neural networks for sequential recommendation patterns. The previous studies [24,25,26,27,28] do not explicitly target cold-start-to-warm-up transitions but are concerned with introducing temporal patterns in operational RSs. However, these approaches could provide a model for the warm-up phase on the basis of preference patterns mined by previous like-minded users.
An alternative line of research involves the warm-up phase, specifically for neural network-driven RSs, which use embedding layers in neural network architectures. An item embedding layer transforms all items into a vector representation (a latent N-dimensional space) and uses the latent representation of the items to train a neural network. Some studies [9,10] recognized the existence of a gap between cold-start item (user) embeddings and the same embeddings for a fully operational model and proposed scaling transitions [9,20]. They discussed how a good embedding layer substantially improves recommendations for familiar items; however, it suffers from cold- and warm-start initializations [20]. Another study proposed an embedding generator for new items that uses item features to generate an initial embedding rather than relying on a randomly generated start [9]. Both studies focused exclusively on the new-item problem and reported that incorporating incoming item reviews during the warm-up phase produced noisy embeddings, resulting in worse performance than their proposed smoothing approaches.
Neural network-driven approaches to the warm-up phase include applying meta- and reinforcement learning, as discussed in [21]. However, this approach is criticized because it is still limited by a cold start. Moreover, it is affected primarily by poor computational efficiency during warm-up, as Chen et al. [22] reported; therefore, the same authors proposed an alternative tree-based methodology driven by reinforcement learning to address these shortcomings.
A recent study on dynamically adjusting features in the context of a traditional CF approach [23] addressed the temporal change in preferences by allowing some standard features of a CF system to change on the basis of recent user‒item interactions. The latter work did not directly address the cold-start or warm-up phase but shared the idea with another author in [31] to formulate a user–item dynamic feature approach to be incorporated in a cold-start framework—that of stereotypes [4]. In [11,33], the cold-to-warm transition was recognized, and a fusion algorithm was proposed to transform new cold users into warm users.
The major shortcomings of the various methods reviewed and the literature reviews addressing cold start are that most of these studies do not consider the warm-up phase, data leakage, and look-ahead bias as challenges of offline RS evaluations. A small set of studies in which some of these problems are recognized are applied to specific deep learning or reinforcement learning RS formulations and cannot be generalized to others. The explosion in research on neural networks and deep learning models seems to have been accompanied by a limitation of their advancement. As discussed in [34], over 90% of the claims regarding the results obtained via such methodologies are not reproducible or lack the information required to make the results reproducible when the authors attempt recreation. Therefore, we use a deep neural network (DNN)-driven RS as one of the baseline models but are not restricted to only this type, demonstrating how similar results apply to standard, simpler algorithms that interested researchers find easy to reproduce.

3.3. RS Evaluation

RS evaluation has evolved considerably over the past decade. Online (or live) tests provide valuable out-of-sample validations by assessing real user feedback. However, these methods lack key advantages such as reproducibility and scalability for fair algorithm comparisons. Only a few large organizations, such as Amazon and Netflix, can conduct statistically significant live evaluations. RS accuracy has long been the main and only evaluation criterion; however, in recent years, other characteristics have garnered attention regarding the performance of RSs. Fairness in RSs is related to under-recommendation biases that emerge during training. Other key properties include serendipity (suggesting unexpected content) and catalog coverage (minimizing undiscoverable content). However, fairness has become a central concern because of its impact on stakeholders and reputational risks [35]. In RSs, fairness is multifaceted and context-dependent [36,37], covering demographic bias (e.g., in loan recommendations) and gender discrimination (e.g., in job candidate RSs).
During the handling of warm-ups and sequences of recommendations, the temporal consistency of the approach and training data is a key feature that has not been discussed in depth in published research [15]. When training RS algorithms, the ratings are often partitioned into training and test sets by sampling criteria that neglect the time order of events or do not recognize that certain [36,38] preferences expressed at a future time by a group of users are going to create a data leak (look-ahead bias) in the RS when tested on past leftover reviews. As discussed by Ji et al. [15], the overwhelming majority of offline tests presented in the literature to date do not recognize or address this limitation in their experimental evaluations.

4. Model Formulation

4.1. Proposed Models

We approach this section using the same notation as that introduced in the preliminary work [31] and with the symbols summarized in Table 2, where a generic RS can be written in functional form as
r i u = F I i , λ i , U u , μ u
The liking expressed by user u for item i is represented by a quantifiable explicit or implicit rating r i u . The RS can be considered a function that expresses r i u via the user’s representation U u ; the element representation   I i ; all previous item/user scores r i ω expressed for the same item i by all users other than u (i.e., all available previous ratings expressed for the same item by other users), with their corresponding feature vectors U ω ; and all previous ratings r η u that user u expressed in the past for other items η (where η represents the previous items, if any, different from item i rated by user u), as well as these items’ representation vectors ( I η ). Following [31], the effect of all other ratings from users ( ω   different from u) having previously consumed/rated item i can be condensed and simplified into a characteristic item i bias ( λ i ), and the effect of the ratings provided by user u for previous items ( η different from i) can be synthesized via a characteristic user bias ( μ u ) These biases, which are subconscious reference points for users or attributed to items, are commonly introduced in the RS literature [31]. A user bias implies that each user has a “neutral” scale; therefore, on a rating scale of 0 to 10, a user may be inclined to cluster their rates around 6 and another one around 8—such bias is our μ u . Similarly, items in the same group (e.g., bicycles) might exhibit different rating biases; for example, city bicycles are clustered around a rating of 6, whereas mountain bikes might receive a neutral rating of around 8. We do not justify why such biases exist but recognize that they have been documented [4,31,39].
In the paragraph above, the word “previous” is important and often overlooked in most of the literature, which omits temporal dependencies and consistency.
The cold-start problem of the new user can be modeled as follows:
r i u = ϕ I ˜ i , λ i , U ˜ u , μ ¯ U ˜ u .
In Equation (2), the functional shape differs from that of the general problem in indicating a functional expression that caters to the cold start of a new-user problem. We introduce the user- and item-encoded representations I ˜ i and U ˜ u to simplify the space of the user and item metadata and provide a denser coordinate system capable of reducing sparsity. In the new-user cold start, given that no prior user-to-item interactions are available for the new user, we introduce μ ¯ U ˜ u in Equation (2), representing the typical user bias toward the encoded features of all previously observed users belonging to a similar user encoding to the new user under examination. The modeling assumes that a bias typically exists among similarly encoded users toward similarly encoded items and that this average bias is a good proxy for the bias of a new unknown user. The bias can be imagined as a typical extra like or dislike for a category. For example, one might discover that users who like sci-fi appreciate certain items more, such as Dune or Star Wars, and perhaps others less (e.g., Barbie or Gone with the Wind).
Encoding can be performed in various ways. However, discussing, reviewing, or evaluating various encoding approaches is not an objective of this study. Two effective techniques used to demonstrate the role of the proposed models are presented in previous works: the stereotyping of metadata and encoding via simple neural layers [4].
Similarly, the new-item cold-start problem can be written as
r i u = ψ I ˜ i , λ ¯ I ˜ i , U ˜ u , μ u .
In Equation (3), when no interactions are available for item i , λ ¯ I ˜ i represents the typical bias across all items belonging to the same (closest) encoding. For example, jazz songs tend to have a positive average bias among older age groups and a lower average bias in younger age groups.
In Equations (2) and (3), F . is replaced with ϕ . ,   ψ . to indicate that both problems are handled by two “training runs” of the same RS. In evaluating stereotypes and neural encoding, previous studies [4,7] argued that different functional forms could be used for cold-start functions, from simple regressions to more complex machine learning algorithms and deep learning models. These algorithms affect the specific cold-start performance of RSs (2) and (3). Thus, stereotype-driven encoding increases the performance metrics of the new-user and new-item cold-start problems for a given fixed functional form of the RS.
During the warm-up phase, cold-start models (2) and (3) cannot accommodate the first, second, and n-th interactions that become available for the new user and catalog items (or the new item and existing users). Such information is too little to switch away from pure-cold-start models and is too valuable to be disregarded; it constitutes the first hints of precious personalization, and as such, it should not be neglected. In a previous study [31], cold-start models (2) and (3) accounted for dynamic user and item biases. During the warm-up phase of a new user, when k ratings have been expressed, the k + 1 -th consumption of item i can be written as
r i u k + 1 = ϕ I ~ i , λ i ,   U ~ u , μ u k + 1 μ u k + 1 = 1 α · μ ¯ U ~ u + α   · < μ u > 1 , , k α = ( k / N u ) γ  
In Equation (4), μ u k + 1 represents the dynamical bias that adjusts as the user is further personalized. The model is a simple weighted departure from the average bias; that is, when nothing is known about a new user (e.g., when k = 0), no personal interaction data are available yet for user u, and the model returns α = 0; therefore, the user’s bias is entirely derived from the group of similarly represented users ( μ ¯ ( U ~ u ) ). As the number of interactions increases (k = 1, 2, …,   N u )   , the α value begins to increase and approaches 1. Thus, when α = 0.5, for example, half of the bias effect is derived from the group of similar users, and half of the bias is derived from the previous ratings of user u. When α = 1, the model states that the user is fully characterized by a bias derived only from previous interactions. The weighting parameters, N u and γ , are characteristics optimized during training. The values obtained for these parameters depend on the data and domain application, as different applications may require more or fewer interactions to characterize a user. Optimization is a function of the algorithm used as a solver for the RS, as discussed in Section 4.2.
The new-item warm-up phase was modeled with a similar concept:
r i u k + 1 = ψ I ~ i , λ i ( k + 1 ) ,   U ~ u , μ u λ i k + 1 = 1 α · λ ¯ I ~ i +   α   ·   < λ i > 1 , , k α = ( k / N i ) γ  
In Equation (5), λ i k + 1 represents the dynamic adjustment of the item i bias. This model improves item characterization via λ i k + 1 as the number of user reviews k increases. Equation (5) is the equivalent of Equation (4) but is adopted for the new-item warm-up case. When a new item has not received any ratings (k = 0, hence α = 0), the bias is derived entirely from similar items based on their encoded representation, denoted by ( λ ¯ ( I i ~ ) ). As the new item receives ratings, k = 1, 2, …,   N i   , the value of α increases progressively. For example, when α = 0.5, half of the bias effect is derived from the group of similar items, and half of the bias is derived from the previous ratings of item i. When α = 1, the model states that the item is fully characterized by a bias derived only from previous interactions. The number of interactions, N i , is in principle different from the number required to characterize user N u .
The dynamic biases should not be confused with the bias terms of other formulations, such as those in [20,21]; they are corrections that can also be interpreted as biases toward an item or of a user but applied to different formulations and are therefore unrelated.
As discussed in [4], the dynamic bias encompasses the personalization of the new user (or the new item) during the warm-up phase. Nonetheless, this effect considers only the history of the (or item) under consideration. Therefore, learning from the recommendation/interaction paths of similar users during their warm-up periods has no effect. Herein, we propose a Bayesian-driven contribution to augment and complement the warm-up framework. Following a standard Markov transition approach, as in [29], we introduce a transition matrix P i j where the i-j entry refers to the probability of a user interacting with item i after having consumed item j . This simply extends well-established statistical approaches [40] to the case of an item-to-item transition. The modeling assumption is that the items recently consumed (rated) by a user contribute to determining the next items that the user may enjoy most on the basis of patterns followed by similar users. All the items a user interacted with and rated, or a recent subset, can be used to define a probabilistic state vector, s j , representing the known (or recent) user’s taste. In a catalog with C items, the transition matrix P is a C × C probability matrix (i.e., each i-j entry defines the transition probability from i to j). The state vector is a 1 × C column vector. The probability distribution of a user interacting with item i can be obtained via the product p l = P l j s j where the Einstein summation notation over repeated indices has been adopted. Operational ways of estimating the state vectors and probability transition matrices can be formulated; however, predicting the items most likely to be consumed by a user, as discussed in [29], is not the same as predicting the items that the user would enjoy most. On the basis of this difference between the probability of consuming and enjoying an item (via a rating metric), we suggest reformulating the transition approach as follows: Let user u be a new user that has just rated some K initial items via r k u for k = 1 , .. , K . A preference transition matrix can be obtained via previous user‒item interactions, similar to matrix P , which indicates the expected rating of all the items in the catalog after consuming an item i , i.e., R i j for i = 1 , .. C , where C is the total number of items available in the catalog. The construct is similar to that of the probability transition matrix; however, the entries of the matrix R do not represent probabilities. Thus, the preference transition approach can be written as
r i u k + 1 = N k R i j r j u k ,
where the preference transition matrix, R i j , can be derived via a one-step or multi-step calibration of previously observed transitions. In this study, when building the rows in R i j , we focus on a one-step transition. The algorithm groups all users who previously consumed item j and examines the single step (which items were consumed/rated before and after item j ). Such a formulation can be extended to a multi-step transition. However, this introduces additional complexity, which is outside the scope of the current study. In Equation (6), r j u k represents the user u rating vector of the previous k . items. The function N k ( ) is a normalization layer allowing the matrix–vector product to be rescaled into the rating space, considering that the user u has consumed the first k items (i.e., there are k nonempty entries in the rating vector r j u k ). We formulate the problem in Equation (6) by referring to the new-user case to guide the reader; however, the same equation applies to the new-item case, with its derived transition matrix and normalization function. Regarding the new-item problem, the transition matrix product with the state vector does not concern items but rather existing users. In the new-user case, the transition step is interpreted as “Given an item i consumed by the new user u , find and rank what other items may be relevant to user u by determining what other users (established users) ω consumed before and after item i .” The same transition matrix formulation can be applied to the new-item problem: “Given that user u consumed a new item i , find and rank what other new item i may be relevant to users by discovering what other items (established items) λ user u consumed in a neighborhood before and after item i , and link those preferences to other users.”

4.2. Numerical Implementations of Recommendation Algorithms

Hitherto, we left generic recommendation algorithms—phi and psi of Equations (2) and (3)—to highlight that the construct of Equations (4)–(6) can be applied to different RS algorithms because they constitute an independent element from the recommendation model considered. Previous studies [4,7] revealed that stereotype-driven cold-start systems can outperform other state-of-the-art systems, such as enhanced singular value decomposition (SVD++) [41,42]. In addition, RSs driven by deep neural networks generally outperform traditional machine learning techniques in terms of accuracy. This study aims to demonstrate the impact of the proposed warm-up models independently of the specific RS algorithm, thereby generalizing their effect to the warm-up phase. We focus on investigating how three major classes of RSs (one class based on classical machine learning, another based on matrix factorization, and the third based on deep learning) perform during the cold-start-to-warm-up transition and how the proposed models improve their performance. Several other representatives of the RS from each class could have been selected. We selected the ones discussed below because they encompass most of the categories in the literature and are simple enough to extrapolate the findings to their representative classes.
We used the following algorithms for the baseline RS to evaluate our models during the warm-up phase:
  • SVD/SVD++ is a well-known baseline model in the RS literature—Svd.
  • The stereotype model with the Xgb solver discussed in [15] represents the best pure-cold-start stereotype-based model in that study—Xgb stereo.
  • The pure-cold-start RS with a DNN is discussed in [4] in its standard form and stereotyped formulation—Dnn, Dnn stereo.
  • Xgb stereo with the dynamic user and item biases of Equations (4) and (5)—Xgb dynamic.
  • DNN (the best deep neural network baseline), with the dynamic user and item biases of Equations (4) and (5) as extra processing layers—Dnn dynamic.
  • XGB dynamic model with the preference transition matrix approach in (6)—Xgb full warm.
  • DNN dynamic model with the preference transition matrix approach in (6)—Dnn full warm.
To train a given RS in the case where the catalog contains C items and U users, these can be stereotyped into E embeddings. Models (4) and (5) require optimization, in addition to the baseline characteristic parameters of the model chosen and further 2E parameters. Conversely, model (6) requires further estimation of E2 transition ratings. When training the models, we followed a two-step optimization strategy: We first learned the baseline model for the cold-start problem. Then, we optimized the parameters of models (4), (5), and (6) to the residuals of the baseline model’s prediction. We then retrained the baseline using the parameters identified by the warm-up model, keeping them fixed during in-sample predictions. In our experiments, this approach produced more stable optimization results compared with the alternative of jointly optimizing a combined loss function for both the baseline and warm-up models in a single training pass. With respect to the baseline models, we used stochastic gradient descent adaptive moment estimation (ADAM) [43], with the tuning strategy in Appendix A (Table A1 and Table A2). Conversely, we used a Bayesian optimization framework for the warm-up models [44,45].
Listing 1 presents high-level pseudocode for the experimental pipeline, including the training of base recommender systems and the integration of the warm-up-specific models. The listing captures how the experiment evolves over “time.”
Listing 1. Warm-up-aware recommendation: experimental pipeline. High-level pseudocode outlining the training and evaluation process of base and warm-up recommendation models under cold-start conditions.
# INPUT:
# - Dataset with timestamped user-item interactions + user/item metadata
# - Choice of base RS model: SVD, XGBoost, or DNN
# - Hyperparameters for training

# STEP 0:
# Select a number of user-item interactions deemed sufficient to train a model, M
# Select a number of user-item interactions deemed sufficient to test a model, T
sort all interactions by timestamp
j = 0

for the set of interactions R from 0 to j * M + M:
    # PREPROCESSING STEP 1 (time-consistent training set):
    Identify and retain only the users U and items I whose interactions fall within R
    # PREPROCESSING STEP 2 (encoding):
    Encode identified user and item metadata into embedding vectors (e.g., via stereotypes)

    # ----- NEW USER PROBLEM -----
    # TRAINING Step A (New user case):
    Compute user-group encoded biases μ_enc(u) for each encoding of users in U
    Compute item-group encoded biases λ_enc(i) for each encoding of items in I

    # TRAINING Step B-new_user (New user case):
    Train RS_base_Nu on interactions in R (Equation (2))
    using individual item biases and static encoded user biases μ_enc

    # TRAINING Step C-new_user (New user dynamic warm-up):
    Optimize parameters γ and N_u by minimizing residuals when
    using the trained RS_base_Nu from Step B-new_user, replacing μ _enc with model (4)
    → Obtain RS_base_dyna_warm_Nu

    # TRAINING Step D-new_user (transition probability):
    Optimize the parameters of the item-to-item transition matrix E using sorted
    interactions R to minimize residuals between predictions via RS_base_dyna_warm_Nu
    and the actual rating Obtain RS_full_warm_Nu

    # ----- NEW ITEM PROBLEM -----
    # Repeat Steps A, B, C, D for the new item case using Equation (5)
   →Obtain RS_base_Ni, RS_base_dyna_warm_Ni, RS_full_warm_Ni

    # ----- INFERENCE -----
    Take the next T time-sorted interactions from (j * M + M) to (j * M + M + T)
    Extract the new users and new items from the T set

    # New user problem:
    Predict ratings and ranked lists of items for all new users in T using
    RS_base_Nu, RS_base_dyna_warm_Nu, RS_full_warm_Nu

    # New item problem:
    Predict ratings and ranked lists of users for all new items in T using
    RS_base_Ni, RS_base_dyna_warm_Ni, RS_full_warm_Ni
    j += 1
Concerning computational complexity, for a catalog with C items, U users, and E embeddings, Table 3 lists the order of complexity of the calculations for each phase. The deep learning model was the most expensive in training and inference, XGBoost was moderately expensive, and the matrix factorization approach was the least expensive.
Regarding computational complexity, Table 3 suggests that the overhead of training the dynamic bias models (Equation (4) for the new-user problem and Equation (5) for the new-item problem) is linear in the encoding/embedding size (E), representing a minimal addition compared with the training cost of the base recommender system. Training the transition matrix model, Equation (6) is quadratic in E but below the complexity of training the base models, particularly DNN and XGBoost, and comparable only to SVD. Thus, the proposed models are computationally efficient and well suited for integration into existing RS pipelines with minimal added cost.

4.3. Model Applicability

In Section 4.1 and Section 4.2, we present the model formulation, along with its computational complexity and offline experimental implementation. In this subsection, we outline the practical applicability of the proposed models and their potential deployment benefits.
The central motivation behind our formulations is to improve performance during the cold-start-to-warm-up transition phase. This challenge has been widely recognized in the literature: as discussed in the introduction, there is substantial evidence of user and content-provider attrition on new platforms. New users often abandon systems after only a few interactions if they are unable to discover relevant content, while new-item providers (e.g., sellers) may leave platforms if their items do not receive fair exposure or engagement.
Table 4 highlights representative domains and use cases where cold-start or warm-up challenges are especially relevant. It maps each scenario to the corresponding components of the proposed method and summarizes the expected outcomes in each case.

5. Experiments

5.1. Data

The ideal dataset for illustrating models (4), (5), and (6) should possess distinctive key properties. Interactions should be timestamped to ensure a comprehensive record of the platform’s evolution, including the arrival of new users and items. This temporal granularity is crucial to preventing data leakage during training and evaluation, as discussed in Section 1. Second, the dataset should contain abundant metadata, ideally for users and items, to allow for richer vector representations of users and items. Finally, accessibility is key, and the dataset should be publicly available to facilitate the reproducibility of the research findings.
To the authors’ knowledge, very few publicly available datasets fulfill all the stated requirements. One such dataset used in this evaluation is the MovieLens/IMDb integrated dataset [32]. The MovieLens data are the subject of many RS studies; however, integrating such data with the IMDb features has been addressed in only a handful of articles. In addition to the one from the author of [7], two other such studies are [46,47]. A second dataset used for validation of high-level findings is the one from Amazon.com, in particular their Sports and Outdoor sub-dataset [48].
Table 5 summarizes the key characteristics of both datasets, including the number of users, items, and ratings, as well as the number of user and item features. The encoding procedure follows the automatic stereotype generation approach described in our previous work [7], to which we refer interested readers for further details.
Depending on the cut-off thresholds used for encoding numerical features, the automatic stereotype procedure yields metadata embeddings with dimensionality ranging from 250 to 500 for the MovieLens/IMDb-enriched dataset, mostly driven by the encoding of item features [32]. For the Amazon dataset, the encoded dimension is more modest, at a dimension of 75–100.
When ranking the importance of stereotyped features, we observed in our previous study [7] that the top three item features for the MovieLens dataset were Genre, Budget, and Movie Popularity. These dominated over other attributes such as Cast, Cast Popularity, Production Year, Production Country, Runtime, View Count, etc. For user features, Gender and Age group were the most influential, significantly outweighing factors like Location and Occupation. In contrast, the Amazon dataset exhibited less pronounced dominance of a small set of features. Among item features, only Category and Price Range showed a modest dominance over other features. User features in the Amazon data appeared more evenly distributed in importance, without clear dominance.
We build on the results of [4,7], which demonstrated that stereotype-based encodings for these datasets produced better standardized recommendation models (in terms of accuracy) compared with models using either the raw native features or neural network–based encodings. As this sensitivity analysis has already been thoroughly explored in previous research, we consider feature selection and encoding strategy to be outside the scope of the current study.

5.2. Cold-Start Warming Experiments

The characteristic biases and imbalances in some of the resulting categories (e.g., documentaries or Western languages have much fewer ratings than other classes do, and foreign language users are a minority compared with English language users) are preserved and used to determine some of the fairness and serendipity characteristics discussed in later sections.
A set of users that did not participate in model (2) training was excluded from the new-user pure-cold-start experiment and used as a test set (applying cross-validation across all users in the database). Additionally, the catalog of items is “purified” by the items the new users would have been unable to rate, which is when they became new users during their first reviews because future items were unreleased. In an extreme cold start, the model produces recommendations and ranked lists for a new user in the test set, assuming that the user has not rated any items. The generated recommendations are subsequently evaluated against the user’s actual preferences via a range of metrics, including consumption/nonconsumption, rating values, and ranking accuracy. Pure-cold-start experiments on the new item were conducted similarly. A group of items is excluded from model (3) “fit” (applying a cross-validation approach across all items in the database); all user‒item interactions that concern the excluded items are blanked out. Similarly, during the new-item training model, all users who came online on the platform after the release of the new item were excluded to ensure time consistency (i.e., the new item is rated only by the user that is on the platform at the time of the new-item release). For each new item in the test set, the model predicts the users most likely to consume it, as well as their expected ratings. The accuracy, consumption, and ranking accuracy can be derived by comparing the predictions with the excluded users who interact with the item.
Starting from the pure-cold-start experiments, further constraints are imposed in the warm-up phase. When a user (item) is excluded from representing a new user (new item), the temporal consistency discussed in Section 1 must be addressed during the experiments. First, following [15], we retained only the triplets (users, content, and ratings) that were expressed prior to the new user (new item) being online to train models (2,3) to prevent information backpropagation (leakage). Second, the temporal order of the triplets must be retained to train and evaluate models (4), (5), and (6).
This is because dynamic biases must change when accumulating reviews with the same history as the ratings placed by the new user (or given to the new item). In addition, the transition step depends on the state of the few initial reviews given by the new user (or received by the new item). To the author’s knowledge, this evaluation is more thorough than those presented in the literature, where data are gathered irrespective of their temporal characteristics.
To highlight the personalization effects of the proposed novel models, we present several metrics as functions of the interaction count W , i.e., the number of ratings expressed by the new user (received by the new item) during the warm-up phase. When assessing the performance metrics at W , only the following interactions, W + 1 , W + 2 , , W + N , are used for evaluation. In the context of the analyzed dataset, with the filtering restrictions discussed in the previous paragraph, the average user has over 130 reviews; conversely, the average item has been reviewed over 200 times. Therefore, the warm-up phase for our experiments ranged from 0 (pure cold start) to approximately 30 ratings (warm-up). These cut-offs, which can be determined via the optimization procedure of the model, vary with the characteristics of each dataset used.

6. Results

6.1. Accuracy

The accuracy of single recommendations is a major metric used in RS research; however, it is now considered outdated because it does not consider that users are more interested in recommendation lists and that accurate recommendations at different positions in a ranked list have different values for stakeholders. We focus on the accuracy metrics of an actual ranked list in the following subsections. Here, we present simple accuracy because it remains a widespread metric in the RS literature.
The accuracy of single recommendations is customarily measured via the root mean squared error (RMSE) [16,36]. Figure 1 displays the accuracy of the predicted ratings in the new-user and new-item experiments as a function of the number of warm-up reviews. W . The figure also illustrates the confidence intervals of the estimated values at a p-value of 0.05 (95% confidence). The baselines for pure cold start (Svd, Xgb stereo, and Dnn stereo) do not exhibit marked improvement or degradation in their RMSE as W increases for the new-user case (the case mostly studied in the literature). Only Dnn stereo has minimal degradation as W increases; nonetheless, it remains the best-performing pure-cold-start baseline algorithm under this metric. All pure-cold-start baselines tend to slowly degrade with W because once the recommendations for the dominant stereotypes have been made, improving on the most popular items of the given user stereotype without personalization effects is more difficult if not impossible.
The same pure-cold-start performance displays minimal improvement for the new-item case as a function of W . At least four distinct and crucial facts can be inferred from the results of Figure 1. First, the personalization effects of the dynamic and fully warm algorithms introduced in Section 4 induce a significant improvement in the recommendation performance for the new-user and new-item experiments, with the improvement growing rapidly with low W and tapering off to an asymptotic value toward the end of the warm-up count, W . Second, the improvement obtained was independent of the RS algorithm used in the experiments. Both the RSs driven by Xgb and Dnn have distinctive cold-start base-prediction abilities. However, the warm-up of model bias (4) and (5) and preference transition (6) are similarly beneficial to both functional forms and across experiments. This suggests that the proposed models provide a direction for improving different RS formulations. Third, we discuss how the improvement due to the transition matrix approach in (6) is more pronounced in the new-user case (independent of the RS chosen). In the new-item case, the extra performance boost of model (6) (fully warm) compared with models (5) and (4) (dynamic) can be considered secondary. This finding may be linked to the fourth interesting observation on model (1): new-user and new-item cases behave differently. As the number of items rated by a new user increases, it becomes more challenging for the pure-cold-start model to provide recommendations. This is because the pure-cold-start model uses well-known items, which are highly rated across the users and generally make good recommendations that can be made earlier in the review. Personalization during warm-up for the new-user case improves the RMSE (accuracy) despite the pure-cold-start accuracy. The earlier reviews for the new-item case are the most challenging to predict—both the cold-start model and the warm-up. Thus, dynamic personalization improves the accuracy as W increases, with warm-up models increasing the accuracy gain across different recommendation algorithms.

6.2. Ranking Accuracy

Ranking accuracy metrics [7] arise because users browse ranked lists of suggestions. Moreover, the ranked list is more valuable when it has more “hits” (content that the RS suggests and the user selects) and the selected items are liked more. Similar to the new-item problem, introducing new content on the platform is recommended for some existing users, and a good RS suggests a ranked list of users who will find the new item interesting. Ranking accuracy metrics measure the usefulness of a recommended list for the stakeholder. Several metrics exist for this purpose, and [7] provides a panoramic overview. In this context, we focus on two of the most popular measures: hit rate (HR) and Normalized Discounted Cumulative Gain (NDCG). The HR provides an overview of the hits in a ranked list, which is among the simplest ranking accuracy metrics; however, it is not an effective way to compare ranked lists of different RSs. The NDCG defines how valuable a ranked list is on the basis of (i) the content of the ranked list consumed by the user (hits), (ii) the ranking of such content (the higher the ranking of a hit, the higher the NDCG), and (iii) the rating of such hits (the higher the rating expressed for a hit that was higher in the list, the higher the NDCG). A previous study provided details on the calculation [7]. The HR is one of the simplest ranked list metrics; in contrast, the NDCG is one of the most sophisticated. Therefore, evaluating these two metrics over a broad spectrum will strengthen the findings.
In our experiments, we ran several top-N lists (top 10, 20, and 30) and observed that the length of the ranked list, while affecting the overall metric values, did not affect the conclusions that could be drawn regarding the relative importance and effects of the various algorithms. Therefore, we focus on one ranked list, which is the top 10 items recommended at W . Notably, the ranked list content is dynamic with W for two reasons: First, the new interactions available for the user (with the new item) affect the models (Xgb and Dnn with warm-up dedicated models that are dynamic and fully warm). Second, a user at interaction W may hit an item that needs to be removed from the ranked list presented at W + 1 . This effect is often neglected in other studies where static ranked lists are examined.
Figure 2 illustrates the HR for all the models during the warm-up period. The primary finding is the significant improvement that the transition preference matrix approach in (6) brings to the quality of the ranked lists. With respect to simple accuracy, the transition preference matrix approach provided only a minimal increase in the RMSE compared with the dynamic bias warm-up components of models (4) and (5); however, we observed the opposite for the ranked lists. The dynamic biases are marginally beneficial to the list hit rates; however, considerable benefits come from the preference transition matrix model. While investigating this effect, we observed that the shallow effect of CF from model (6) enables the catalog of ranked lists to comprise more items than it would if one used the model to predict ratings and biases of users/items. In addition, extra content tends to be more valuable after the first few interactions of the user with the catalog, leading to maximum improvements of approximately 25% in HR. As the review count increases, the RS’s ability to produce valuable hits experiences minimal attrition, as depicted by the HR and the tendency of the NDCG to gradually decrease with W , as discussed later. Independently of the algorithm used, it becomes more challenging to recommend valuable content in the new-user and new-item problems as the W interactions grow because the “most recurrent patterns of recommendations” have been used. Another finding stemming from the RMSE and HR graphs is that the standard SVD-driven technique substantially underperforms compared with all other models in accuracy and ranked list hits. Therefore, we removed it from the following results to make the data in the plots more readable.
When analyzing the NDCG of the top-N list, we observed a behavior similar to that of the HR, with two distinctive characteristics in Figure 3 and Figure 4. The HR peaked later than the NDCG ( W at 10–15 for the HR max versus 5 for the NDCG max). The difference between the dynamic-only models and the fully warm model was greater than that in the HR. An investigation of these two minor differences revealed that they could be attributed to the ranking of the items. Transition matrix model (6) injects novel and meaningful content into the ranked lists owing to its collaborative effect. This content is indeed consumed; however, the ranking of these new hits is lower in the top-N as W increases, explaining why the NDCG peaks earlier and model (6) improves the HR more than the NDCG does compared with models (4) and (5).
Our accuracy and ranking accuracy metric results reveal that models (4), (5), and (6) complement each other in the warm-up phase. The former models (4) and (5) lead to heightened accuracy during warm-up on the basis of fast user and item personalization; conversely, the latter model, model (6), increases the likelihood of “hits” in the top-N recommendations as interactions are recorded and W increases.

6.3. Serendipity and Fairness

Over the last few years, other aspects of RS performance beyond accuracy have gained attention. Serendipity and fairness are two RS qualities that are loosely related but address two different aspects. Serendipity refers to the ability of the recommendation algorithm to surprise the user with unexpected items, items that reveal other areas of the catalog, and items that the user may not have intentionally discovered [13]. An RS optimized to maximize accuracy will likely exhibit overspecialization and an inability to provide users with items from other catalog branches, thus failing to exhibit serendipitous traits.
The fairness of recommendations, a relatively new evaluation concept for RSs, has gained attention in recent years [12,14]. This evaluation aims to reduce the reputation risk for the RS host by not skewing recommendations to users based on their traits and not impairing the visibility of content providers. Item providers should be confident in the fairness of the RS such that their products will receive sufficient visibility and will not be over-ranked unfairly. An unfair RS (one that develops patterns that overly favor certain items or item providers) constitutes a reputation and lawsuit risk, such as the challenges faced by Amazon [12], and may drive content providers away from a commercial platform.
Regarding serendipity, there is no agreement in the literature on a unified metric such as, for example, that of ranking accuracy. Many definitions have been proposed, with most of them being tailored to the needs of specific datasets. In a previous study [7], the authors proposed a metric for calculating a serendipity proxy that can be adapted to any possible dataset in which metadata features are available. In particular, all metadata features can be stereotyped (i.e., discretized/clustered), whether numerical or complex categorical (a complex categorical variable may assume many labels in a non-predefined sequence). The first few important features can be focused on via a simple feature importance ranking tool (e.g., Lasso regression). However, focusing on one or several features does not change the definition of serendipity, only its computed values, and makes the selection arbitrary. Instead, a proxy for the serendipity of a ranked list can be computed by taking the ratio of all distinct, discrete entries for the selected important features to the union of all possible entries in the catalog. In this context, the serendipity metric chosen is the same as that in a prior study [7] because, to the best of the author’s knowledge, this metric is the only definition that can be applied across domains to new-user and -item problems while providing values comparable to those obtained in the prior study [7].
Figure 5 and Figure 6 display the serendipity metric (total discretized features as a percentage of the total catalog coverage) during warm-up for the top-N ranked lists computed at different W . Previous works [4,7,18] revealed the higher serendipity metric for the new-user case vs. the new-item case, at least for the dataset under examination. In such a dataset, a lower serendipity of the new-item case can be considered a consequence of the algorithm overweighting prediction accuracy in its training phase. The model considers new items with infrequent labels to be isolated vectors in the I space, and the algorithms are discouraged from placing such niche items (such as documentaries in the case of the movie dataset) within the top-N lists, which would improve serendipity but potentially damage accuracy. The same trends presented in the accuracy and ranking accuracy metrics apply to the serendipity of ranked lists during the warm-up phase. Combining models (4), (5), and (6) improved the serendipity of the ranked lists by up to 15% during the warm-up phase compared with the cold-start baselines. Furthermore, the serendipity of the DNN family of algorithms is lower than that of Xgb, which is the first display of accuracy-related overspecialization for a particular RS that will be even clearer when fairness is examined.
Researchers have defined group-based characteristics using one identifying feature (e.g., genre, category, and budget) when measuring recommendation fairness between different stakeholder groups. Fairness is introduced by characterizing the difference in recommendation distributions among groups. This is achieved via the first few moments of the distribution or an integral version, for example, the Gini coefficient [49]. A previous study [4] introduced a metric of fairness derived from the definition given in [13], focusing on the least recommended items in a union of recommendations of the top-N lists, thereby focusing on the left-hand tail of the recommendation distributions between groups and a control group (or the full population). The operative definition of fairness that we use suits the current challenge for two reasons: first, by taking the union of top-N lists recommended by the baseline models and the baseline models augmented with the warm-up, we can examine the level of stakeholder fairness in the various cases without selecting a distinct feature such as genre or gender, as is usually conducted in most studies. Second, this general definition suits user and item fairness. Most studies consider user-based fairness, and very few findings are available on the fairness of content providers (new items).
The operative definition of our fairness metric consists of isolating the tail portion t% of the content available in the union of the top-N lists and calculating the Mean Discounted Gain (MDG) for item i /relevance N:
M D G i = 1 U u U δ ( z i u < = N ) l o g 1 + z i u .
In Equation (7), z i u represents the rank of content i in the ranked list suggested for user u ; δ x is a delta function with a value of 1 if x is true and 0 otherwise. We can obtain a metric of fairness by taking a statistic (e.g., the mean) of M D G i across all users u and items i in the bottom t% tail of the least recommended:
F R N = L U I N M D G i ,
where . is a statistical operator such as the sample mean. L U I is the size of the list of all unique items in the union of all the top-N lists recommended to users U .   N is the number of top items recommended. A higher fairness value indicates a fairer RS, that is, one where the bottom percent of recommended items displays a higher probability of being present in some ranked lists and more such items are available at better positions. The quantity defined in Equation (8) can be considered a proxy for the probability that an item in the tail of the least recommended items (%t) is included in the top-N list.
We show results on fairness from the content provider’s point of view. Figure 7 presents the fairness metric of the ranked lists for the least recommended 15% of the items ( F R N a t 15 % ) in the new-user problem. Analyzing such a metric allowed us to answer the following question: “Are there item groups that are systematically treated unfairly by the recommendation algorithm by not giving these items sufficient exposure in the totality of ranked lists presented to new and warm-up users?” Our findings indicate that the most significant factor contributing to fairness is the recommendation algorithm and that any effect caused by the warm-up models while improving the fairness footprints constitutes only second-order effects. The same pattern is observed for the new-item problem (user-based fairness).
The DNN-based models exhibited the highest ability to specialize and provided the highest accuracy in recommendations; however, specialization comes at the cost of lower serendipity and substantially lower fairness (compared with an alternative machine learning model). Fairness and accuracy appear to be two antithetical quantities. However, the more a researcher attempts to improve accuracy and a ranked list’s accuracy, the more such an improvement may contradict the fairness of the RS, which is consistent with other recent findings on the accuracy/fairness tradeoff [50,51].
We conclude this part with a scatter plot of relative improvements versus the baseline cold-start models for accuracy and serendipity (Figure 8). The data presented in this fashion highlight two crucial findings of this study. First, the warm-up-based models provide significant improvements over the base cold algorithms during the warm-up phases. Second, the characteristic footprint of each model in terms of accuracy/variety tradeoffs is confirmed, with each baseline model displaying its own characteristic footprint. The DNN algorithm displays improvements that advance mostly below the diagonal line y = x of the graph, and the Xgb-driven algorithm exhibits improvements above that line. This is further evidence that DNN models tend to specialize and provide accurate recommendations with lower variety traits that can be improved via the proposed models. Conversely, Xgb exhibits improvements in advancing serendipity and accuracy above the diagonal line and therefore a different footprint in the accuracy/serendipity tradeoff.

6.4. Extending to a New Dataset

To further validate the behaviors observed in our main experiments, we extend the analysis to a second dataset: the Amazon.com Sports & Outdoors subset. This step serves as a preliminary confirmation of the findings established using the MovieLens + IMDb data.
Table 6 presents the primary metrics for the new-item problem for a single baseline recommender. We selected this configuration in order to limit computational overhead, given the scale of the Amazon dataset and the resource demands. While this setup is narrower in scope, it is sufficient to verify the robustness of our method on a distinct dataset with different structural characteristics in a different real-life domain (streaming vs. e-commerce). The results (Figure 9) show that ranking accuracy, serendipity, and their tradeoff follow patterns consistent with those observed in previous experiments. Moreover, the behavioral similarity to the new-user results (Figure 8) indicates that the core findings in Section 6.1 to Section 6.3 can be generalized beyond the original dataset.

6.5. Result Summary

We summarize all our findings from Section 6 as follows:
  • The addition of models (4, 5, and 6) enables better warm-up performance under various metrics compared with the baseline and pure-cold-start models.
  • Accuracy/ranking accuracy shows consistent improvements (approximately 10%), particularly in the early interaction phases, where personalization starts to emerge.
  • Serendipity gains: Warm-up models increase serendipity by 5–15% over baseline recommenders, especially during early interactions.
  • Performance improvement is obtained regardless of the base RS utilized, and it is statistically significant.
  • The two models, (4 and 5) and (6), provide two almost independent directions of improvement: better user–item characterization and collaborative recommendation paths.
  • Tradeoff between accuracy and serendipity/variety appears to be a characteristic feature of the baseline selected.
  • The preliminary results reproduce on a second dataset.
  • The added complexity of warm-up models is linear (bias models) or quadratic (transition models), remaining modest relative to base RS training, especially for deep models.

7. Conclusions and Future Research

7.1. Conclusions

In RS research and applications, the operating regimes of extreme-cold-start and fully warm recommendations are treated with two distinct approaches or a single approach that sacrifices performance in either regime. They are often neglected or relegated to marginal problems, with the majority of researchers focusing on fully operating RS performance. This is despite the fact that Localytics, a major application analytics and engagement firm operating for important brands such as ESPN and eBay, noted that most customer attrition on online platforms was observed during the first two weeks of use. CleverTap (Tesco and Etsy) proves that churn rates are the highest for new users and content providers in online marketplaces. Therefore, improvements in recommendation engines during cold-start-to-warm transitions are highly neglected areas with distinct beneficial applications.
In this study, we introduce novel methodologies to improve the performance of any RS when facing the transition from cold-start to warm-up phases. Through a complexity analysis, we demonstrated that the two models presented have parsimonious training and inference step overheads compared with standard RS algorithms, requiring only an ad hoc optimization strategy and an extra inference step for each submodel in addition to the existing RS.
We draw the following conclusions based on the results of the series of experiments:
  • Standard RSs without any special treatment of the warm-up phase may substantially lag in performance compared with adopting a warm-up-dedicated sequence-aware approach, such as the ones presented. We discuss how such performance may provide improvements ranging from 10% to 15% for key metrics. Given that the warm-up phase is likely when new users form their opinions on the platform, including such a treatment in commercial applications is crucial.
  • Although the deep neural network-driven RS seemed to increase in accuracy, its baseline fairness performance was very low in our experiments. Many recommendation platforms adopting such approaches may require sacrificing the accuracy of models to achieve better fairness characteristics. To date, limited research has been conducted on this topic.

7.2. Limitations and Future Work

While this work proposes an effective strategy to improve recommendation quality in cold-start and warm-up phases, several limitations and opportunities for further development remain:
-
Model personalization depth. While we use stereotype-based encodings and dynamic biasing, future enhancements could integrate real-time interaction signals.
-
Deployment dynamics. Our evaluation is based on offline experiments. A promising future direction involves online evaluation via A/B split testing to assess long-term user retention and engagement. At this stage, we can extrapolate expected gains using ratios observed in real-world case studies, such as those discussed by DataColor.ai, where an improvement in their recommendation engine performance of up to 10% reduced customer attrition by 15%. By analogy, we anticipate that the proposed approach could reduce attrition during the critical warm-up phase by approximately 30%.
-
Multi-objective optimization. We focus on accuracy and examine fairness as a side product of the models produced, but balancing and injecting goals such as fairness, diversity, and novelty during the RS training of the warm-up remain open challenges for further research.

Funding

This study was supported by the Ongoing Research Funding program (ORF–2025–1269), King Saud University, Riyadh, Saudi Arabia.

Data Availability Statement

This study used the MovieLens/IMDb integrated dataset described in [32].

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CFCollaborative filtering
CBFContent-based filtering
CTRClick Through Rate
MFMatrix factorization
SVDSingular value decomposition
SVD++Singular value decomposition Plus Plus
NDCGNormalized Discounted Cumulative Gain
RMSERoot mean square error
MAEMean absolute error
HRHit rate
AUCArea Under the Curve
MAPMean Average Precision
MRRMean Reciprocal Rank
UIUser–item
IMDbInternet Movie Database
MLMovieLens
NLPNatural Language Processing
DLDeep learning
SGDStochastic gradient descent
ADAMAdaptive moment estimation
GPUGraphics Processing Unit
SOTAState of the Art
I/OInput/Output
OOVOut of Vocabulary (can apply in cold-start contexts)
RQResearch Question
R&DResearch and Development

Appendix A

Table A1. DNN tuning strategy.
Table A1. DNN tuning strategy.
HyperparameterGrid
Learning Rate (lr){0.001, 0.002, 0.005}
Batch Size{256, 512, 1024}
EpochsMax 100, early stopping (patience = 5–10)
Optimizer (Adam) β10.9
Optimizer (Adam) β20.999
Weight Decay (L2){1 × 10−4, 1 × 10−3}
Dropout Rate{0.1, 0.2, 0.3}
Hidden Layers{3, 5, 7}
Neurons per Layer{128, 256, 512}
Table A2. Xgb tuning strategy.
Table A2. Xgb tuning strategy.
HyperparameterGrid
Learning Rate (eta){0.025, 0.05}
Number of Trees (n_estimators){250, 500}
Max Depth (max_depth){6, 9}
Subsample (subsample){0.6, 0.8, 1.0}
Column Sampling (colsample_bytree){0.25, 0.5, 0.75}
L1 Regularization (alpha){1 × 10−3}
L2 Regularization (lambda){1 × 10−3}
Min Child Weight (min_child_weight){3, 5}
Gamma (gamma){0.1, 0.2}
Early Stopping Patience{10, 20}

References

  1. Lu, J.; Wu, D.; Mao, M.; Wang, W.; Zhang, G. Recommender system application developments: A survey. Decis. Support Syst. 2015, 74, 12–32. [Google Scholar] [CrossRef]
  2. Ko, H.; Lee, S.; Park, Y.; Choi, A. A survey of recommendation systems: Recommendation models, techniques, and application fields. Electronics 2022, 11, 141. [Google Scholar] [CrossRef]
  3. Çano, E.; Morisio, M. Hybrid recommender systems: A systematic literature review. Intell. Data Anal. 2017, 21, 1487–1524. [Google Scholar] [CrossRef]
  4. Al-Rossais, N.A. Improving cold start stereotype-based recommendation using deep learning. IEEE Access 2023, 11, 145781–145791. [Google Scholar] [CrossRef]
  5. Afsar, M.M.; Crump, T.; Far, B. Reinforcement learning based recommender systems: A survey. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
  6. Panda, D.K.; Ray, S. Approaches and algorithms to mitigate cold start problems in recommender systems: A systematic literature review. J. Intell. Inf. Syst. 2022, 59, 341–366. [Google Scholar] [CrossRef]
  7. AlRossais, N.; Kudenko, D.; Yuan, T. Improving cold-start recommendations using item-based stereotypes. User Model. User-Adap Inter. 2021, 31, 867–905. [Google Scholar] [CrossRef]
  8. Anand, A.; Johri, P.; Banerji, A.; Gaur, N. Product Based Recommendation System on Amazon Data. Int. J. Creat. Res. Thoughts IJCRT 2020. [Google Scholar]
  9. Zhu, Y.; Xie, R.; Zhuang, F.; Ge, K.; Sun, Y.; Zhang, X.; Lin, L.; Cao, J. Learning to warm up cold item embeddings for cold-start recommendation with meta scaling and shifting networks. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; ACM: New York, NY, USA, 2021; pp. 1167–1176. [Google Scholar] [CrossRef]
  10. Chen, H.; Wang, Z.; Huang, F.; Huang, X.; Xu, Y.; Lin, Y.; He, P.; Li, Z. Generative adversarial framework for cold-start item recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; ACM: New York, NY, USA, 2022; pp. 2565–2571. [Google Scholar] [CrossRef]
  11. Yuan, H.; Hernandez, A.A. User cold start problem in recommendation systems: A systematic review. IEEE Access 2023, 11, 136958–136977. [Google Scholar] [CrossRef]
  12. Kodiyan, A.A. An Overview of Ethical Issues in Using AI Systems in Hiring with a Case Study of Amazon’s AI Based Hiring Tool. Res. Prepr. 2019, 12, 1–19. [Google Scholar]
  13. Zhu, Z.; Kim, J.; Nguyen, T.; Fenton, A.; Caverlee, J. Fairness among new items in cold start recommender systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), Virtual, 11–15 July 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 767–776. [Google Scholar] [CrossRef]
  14. Bushra, A.; Awajan, A.; Fraihat, S. Survey on the objectives of recommender systems: Measures, solutions, evaluation methodology, and new perspectives. ACM Comput. Surv. 2023, 55, 93. [Google Scholar]
  15. Ji, Y.; Sun, A.; Zhang, J.; Li, C. A critical study on data leakage in recommender system offline evaluation. ACM Trans. Inf. Syst. 2023, 41, 1–27. [Google Scholar] [CrossRef]
  16. Wu, L.; He, X.; Wang, X.; Zhang, K.; Wang, M. A survey on accuracy-oriented neural recommendation: From collaborative filtering to information-rich recommendation. IEEE Trans. Knowl. Data Eng. 2022, 35, 4425–4445. [Google Scholar] [CrossRef]
  17. Panteli, A.; Boutsinas, B. Addressing the cold-start problem in recommender systems based on frequent patterns. Algorithms 2023, 16, 182. [Google Scholar] [CrossRef]
  18. Patro, S.G.K.; Mishra, B.K.; Panda, S.K.; Kumar, R.; Long, H.V.; Taniar, D. Cold start aware hybrid recommender system approach for e-commerce users. Soft Comput. 2023, 27, 2071–2091. [Google Scholar] [CrossRef]
  19. Xu, Y.; Wang, E.; Yang, Y.; Xiong, H. GS-RS: A generative approach for alleviating cold start and filter bubbles in recommender systems. IEEE Trans. Knowl. Data Eng. 2023, 36, 668–681. [Google Scholar] [CrossRef]
  20. Pan, F.; Li, S.; Ao, X.; Tang, P.; He, Q. Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; ACM: New York, NY, USA, 2019; pp. 695–704. [Google Scholar] [CrossRef]
  21. Vartak, M.; Thiagarajan, A.; Miranda, C.; Bratman, J.; Larochelle, H. A meta-learning perspective on cold-start recommendations for items’. In NeurIPS; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6904–6914. [Google Scholar]
  22. Chen, H.; Zhu, C.; Tang, R.; Zhang, W.; He, X.; Yu, Y. Large-scale interactive recommendation with tree-structured reinforcement learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 4018–4032. [Google Scholar] [CrossRef]
  23. Behera, G.; Nain, N. Collaborative filtering with temporal features for movie recommendation system. Procedia Comput. Sci. 2023, 218, 1366–1373. [Google Scholar] [CrossRef]
  24. Rendle, S.; Freudenthaler, C.; Schmidt-Thieme, L. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; ACM: New York, NY, USA, 2010; pp. 811–820. [Google Scholar] [CrossRef]
  25. Wen, W.; Wang, W.; Hao, Z.; Cai, R. Factorizing time-heterogeneous Markov transition for temporal recommendation. Neural Netw. 2023, 159, 84–96. [Google Scholar] [CrossRef]
  26. He, M.; Lin, J.; Luo, J.; Pan, W.; Ming, Z. FLAG: A feedback-aware local and global model for heterogeneous sequential recommendation. ACM Trans. Intell. Syst. Technol. 2023, 14, 1–22. [Google Scholar] [CrossRef]
  27. Gao, C.; He, X.; Gan, D.; Chen, X.; Feng, F.; Li, Y.; Chua, T.S.; Yao, L.; Song, Y.; Jin, D. Learning to recommend with multiple cascading behaviors. IEEE Trans. Knowl. Data Eng. 2021, 33, 2588–2601. [Google Scholar] [CrossRef]
  28. Chang, J.; Gao, C.; Zheng, Y.; Hui, Y.; Niu, Y.; Song, Y.; Jin, D.; Li, Y. Sequential recommendation with graph neural networks. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; ACM: New York, NY, USA, 2021; pp. 378–387. [Google Scholar] [CrossRef]
  29. Ahmed, A.; Salim, N. Markov Chain Recommendation System (MCRS). Int. J. Novel Res. Comput. Sci. Softw. Eng. 2016, 3, 11–26. [Google Scholar]
  30. Quadrana, M.; Cremonesi, P.; Jannach, D. Sequence-aware recommender systems. ACM Comput. Surv. 2019, 51, 1–36. [Google Scholar] [CrossRef]
  31. AlRossais, N. Warming up From Extreme Cold Start Using Stereotypes with Dynamic User and Item Features. In Proceedings of the ACM International Conference on Recommender Systems (RecSys 2023), KaRS, Singapore, 18–22 September 2023. [Google Scholar]
  32. AlRossais, N.A.; Kudenko, D. Isynchronizer: A Tool for Extracting, Integration and Analysis of Movielens and IMDb Datasets. In Proceedings of the Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization, Singapore, 8–11 July 2018; ACM: New York, NY, USA, 2018; pp. 103–107. [Google Scholar] [CrossRef]
  33. Li, P.; Chen, R.; Liu, Q.; Xu, J.; Zheng, B. Transform cold-start users into warm via fused behaviors in large-scale recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; ACM: New York, NY, USA, 2022; pp. 2013–2017. [Google Scholar] [CrossRef]
  34. Ferrari Dacrema, M.; Boglio, S.; Cremonesi, P.; Jannach, D. A troubling analysis of reproducibility and progress in recommender systems research. ACM Trans. Inf. Syst. 2021, 39, 1–49. [Google Scholar] [CrossRef]
  35. Jin, D.; Wang, L.; Zhang, H.; Zheng, Y.; Ding, W.; Xia, F.; Pan, S. A survey on fairness-aware recommender systems. Inf. Fusion 2023, 100, 101906. [Google Scholar] [CrossRef]
  36. Zangerle, E.; Bauer, C. Evaluating recommender systems: Survey and framework. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
  37. Deldjoo, Y.; Jannach, D.; Bellogin, A.; Difonzo, A.; Zanzonelli, D. Fairness in recommender systems: Research landscape and future directions. User Model. User Adapt. Interact. 2024, 34, 59–108. [Google Scholar] [CrossRef]
  38. Jeunen, O. Revisiting offline evaluation for implicit-feedback recommender systems. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; ACM: New York, NY, USA, 2019; pp. 596–600. [Google Scholar] [CrossRef]
  39. Chen, J.; Dong, H.; Wang, X.; Feng, F.; Wang, M.; He, X. Bias and debias in recommender system: A survey and future directions. ACM Trans. Inf. Syst. 2023, 41, 67. [Google Scholar] [CrossRef]
  40. Levin, A.; Peres, Y.; Wilmer, E.L. Markov chains and mixing times. In Postgraduate Textbook; AMS: Providence, RI, USA, 2009. [Google Scholar]
  41. Cao, J.; Hu, H.; Luo, T.; Wang, J.; Huang, M.; Wang, K.; Wu, Z.; Zhang, X. Distributed design and implementation of SVD++ algorithm for e-commerce personalized recommender system. In Proceedings of the Embedded System Technology: 13th National Conference, ESTC 2015, Beijing, China, 10–11 October 2015; Revised Selected Papers 13. Springer: Singapore, 2016; pp. 30–44. [Google Scholar]
  42. Frolov, E.; Oseledets, I. HybridSVD: When collaborative information is not enough. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; ACM: New York, NY, USA, 2019; pp. 331–339. [Google Scholar] [CrossRef]
  43. Kingma, D.; Ba, J.A. A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  44. Frazier, P.I. A tutorial on Bayesian optimization. arXiv 2018, arXiv:1807.02811. [Google Scholar] [CrossRef]
  45. Balandat, M.; Karrer, B.; Jiang, D.; Daulton, S.; Letham, B.; Wilson, A.G.; Bakshy, E.B.T. A framework for efficient Monte-Carlo Bayesian optimization. Adv. Neural Inf. Process Syst. 2020, 33, 21524–21538. [Google Scholar]
  46. Rana, A.; Bridge, D. Explanations that are intrinsic to recommendations. In Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, Singapore, 8–11 July 2018; ACM: New York, NY, USA, 2018; pp. 187–195. [Google Scholar] [CrossRef]
  47. Barkan, O.; Koenigstein, N.; Yogev, E.; Katz, O. CB2cf: A neural multiview content-to-collaborative filtering model for completely cold item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; ACM: New York, NY, USA, 2019; pp. 228–236. [Google Scholar] [CrossRef]
  48. He, R.; McAuley, J. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, Montréal, QC, Canada, 11–15 April 2016; pp. 507–517. [Google Scholar]
  49. Wang, Y.; Wang, Y.; Ma, W.; Zhang, M.; Liu, Y.; Ma, S. A survey on the fairness of recommender systems. ACM Trans. Inf. Syst. 2023, 41, 1–43. [Google Scholar] [CrossRef]
  50. Ge, Y.; Zhao, X.; Yu, L.; Paul, S.; Hu, D.; Hsieh, C.C.; Zhang, Y. Toward Pareto efficient fairness-utility trade-off in recommendation through reinforcement learning. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual, 21–25 February 2022; ACM: New York, NY, USA, 2022; pp. 316–324. [Google Scholar] [CrossRef]
  51. Rahmani, H.A.; Deldjoo, Y.; Tourani, A.; Naghiaei, M. The unfairness of active users and popularity bias in point-of-interest recommendation. In Proceedings of the International Workshop on Algorithmic Bias in Search and Recommendation, Stavanger, Norway, 10 April 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 56–68. [Google Scholar]
Figure 1. RMSE during warm-up phase for new users and items. RMSE (the lower, the better) as a function of the number of interactions during warm-up for all models. Dynamic bias models (Equations (4) and (5)) significantly outperform cold-start baselines (SVD, XGB stereo, and DNN stereo), while the transition matrix model (Equation (6)) offers additional gains. Error bars indicate 95% CI (p = 0.05).
Figure 1. RMSE during warm-up phase for new users and items. RMSE (the lower, the better) as a function of the number of interactions during warm-up for all models. Dynamic bias models (Equations (4) and (5)) significantly outperform cold-start baselines (SVD, XGB stereo, and DNN stereo), while the transition matrix model (Equation (6)) offers additional gains. Error bars indicate 95% CI (p = 0.05).
Computers 14 00302 g001
Figure 2. Hit rate performance during warm-up for new users and items. Hit rate (HR; the higher, the better) as a function of warm-up interaction count across all models. Dynamic bias models (Equations (4) and (5)) yield notable gains over cold-start baselines, while the transition matrix model (Equation (6)) delivers the largest performance improvement during early interactions. Error bars reflect 95% CI (p = 0.05).
Figure 2. Hit rate performance during warm-up for new users and items. Hit rate (HR; the higher, the better) as a function of warm-up interaction count across all models. Dynamic bias models (Equations (4) and (5)) yield notable gains over cold-start baselines, while the transition matrix model (Equation (6)) delivers the largest performance improvement during early interactions. Error bars reflect 95% CI (p = 0.05).
Computers 14 00302 g002
Figure 3. NDCG improvement during warm-up phase for new users. Normalized Discounted Cumulative Gain (NDCG) as a function of warm-up interaction count for cold-start baselines (XGB stereo and DNN stereo) and warm-up models. Dynamic and fully warm models deliver consistent gains of 5–10% over baselines. Performance plateaus as popular items are consumed, with the marginal gains from the fully warm model becoming statistically less significant at higher interaction counts. Error bars indicate 95% CI (p = 0.05).
Figure 3. NDCG improvement during warm-up phase for new users. Normalized Discounted Cumulative Gain (NDCG) as a function of warm-up interaction count for cold-start baselines (XGB stereo and DNN stereo) and warm-up models. Dynamic and fully warm models deliver consistent gains of 5–10% over baselines. Performance plateaus as popular items are consumed, with the marginal gains from the fully warm model becoming statistically less significant at higher interaction counts. Error bars indicate 95% CI (p = 0.05).
Computers 14 00302 g003
Figure 4. NDCG during warm-up phase for new items. Normalized Discounted Cumulative Gain (NDCG) as a function of warm-up interaction count for baseline models (XGB stereo and DNN stereo) and warm-up-enhanced models. Dynamic and fully warm models show stable improvements of approximately 5–8%, particularly at lower interaction counts. The additional gain from fully warm-over dynamic is near the statistical confidence threshold. Error bars reflect 95% confidence intervals (p = 0.05).
Figure 4. NDCG during warm-up phase for new items. Normalized Discounted Cumulative Gain (NDCG) as a function of warm-up interaction count for baseline models (XGB stereo and DNN stereo) and warm-up-enhanced models. Dynamic and fully warm models show stable improvements of approximately 5–8%, particularly at lower interaction counts. The additional gain from fully warm-over dynamic is near the statistical confidence threshold. Error bars reflect 95% confidence intervals (p = 0.05).
Computers 14 00302 g004
Figure 5. Serendipity during warm-up for new users. SER (the higher, the better) scores as a function of warm-up interaction count for baseline models (XGB stereo and DNN stereo) and warm-up models. Warm-up approaches yield statistically significant gains of 5–15% over baselines, with the greatest relative improvement occurring early in the warm-up phase. The higher accuracy of DNN-based models corresponds to lower serendipity compared with XGB. Error bars reflect 95% CI (p = 0.05).
Figure 5. Serendipity during warm-up for new users. SER (the higher, the better) scores as a function of warm-up interaction count for baseline models (XGB stereo and DNN stereo) and warm-up models. Warm-up approaches yield statistically significant gains of 5–15% over baselines, with the greatest relative improvement occurring early in the warm-up phase. The higher accuracy of DNN-based models corresponds to lower serendipity compared with XGB. Error bars reflect 95% CI (p = 0.05).
Computers 14 00302 g005
Figure 6. Serendipity during warm-up for new items. SER (the higher, the better) scores as a function of the warm-up interaction count W across base and warm-up models. The warm-up models statistically improve serendipity over the simpler cold-start baselines by 5 to 20%—the improvement increases over the entire warm-up period. The stronger accuracy-related performance of the DNN model group is not reflected in its serendipity, which is lower than that of Xgb (error confidence levels at p = 0.05; 95% CI).
Figure 6. Serendipity during warm-up for new items. SER (the higher, the better) scores as a function of the warm-up interaction count W across base and warm-up models. The warm-up models statistically improve serendipity over the simpler cold-start baselines by 5 to 20%—the improvement increases over the entire warm-up period. The stronger accuracy-related performance of the DNN model group is not reflected in its serendipity, which is lower than that of Xgb (error confidence levels at p = 0.05; 95% CI).
Computers 14 00302 g006
Figure 7. Item-based fairness during warm-up for new users. Fairness score as a function of warm-up interaction count for cold-start baselines (XGB stereo and DNN stereo) and warm-up models. Improvements in fairness from warm-up models are secondary to the choice of RS algorithm. XGB demonstrates inherently higher fairness than DNN owing to its lower specialization. Error bars indicate 95% confidence intervals (p = 0.05).
Figure 7. Item-based fairness during warm-up for new users. Fairness score as a function of warm-up interaction count for cold-start baselines (XGB stereo and DNN stereo) and warm-up models. Improvements in fairness from warm-up models are secondary to the choice of RS algorithm. XGB demonstrates inherently higher fairness than DNN owing to its lower specialization. Error bars indicate 95% confidence intervals (p = 0.05).
Computers 14 00302 g007
Figure 8. Accuracy/serendipity tradeoff. Relative improvements in accuracy and serendipity over the pure-cold-start models for the new-user experiment. The proposed warm-up models provide a clear direction for improvement in terms of accuracy and variety during the warm-up phase. Each model retains its characteristic variety/accuracy tradeoff, with DNN and Xgb improving below and above the diagonal line, respectively.
Figure 8. Accuracy/serendipity tradeoff. Relative improvements in accuracy and serendipity over the pure-cold-start models for the new-user experiment. The proposed warm-up models provide a clear direction for improvement in terms of accuracy and variety during the warm-up phase. Each model retains its characteristic variety/accuracy tradeoff, with DNN and Xgb improving below and above the diagonal line, respectively.
Computers 14 00302 g008
Figure 9. Relative improvements in accuracy and serendipity over the pure-cold-start models for the new-item experiment. Also shown is the y = x line.
Figure 9. Relative improvements in accuracy and serendipity over the pure-cold-start models for the new-item experiment. Also shown is the y = x line.
Computers 14 00302 g009
Table 1. Literature review matrix: Cold-start and warm-up coverage in RS research. Graphical representation of the few studies dedicated to cold starts, recognizing and addressing the warm-up phase as a major problem, treating the training and testing of the data with a leakage approach, respecting time constraints, and exploring the bias, as explained in a previous study [15].
Table 1. Literature review matrix: Cold-start and warm-up coverage in RS research. Graphical representation of the few studies dedicated to cold starts, recognizing and addressing the warm-up phase as a major problem, treating the training and testing of the data with a leakage approach, respecting time constraints, and exploring the bias, as explained in a previous study [15].
Warm-Up Transition Addressed/MentionedWarm-Up Performance AddressedData Leakage or Look-Ahead Bias Addressed in TrainingTime Event Consistency Addressed in Experiments
Literature reviews on RSs [1,2,3,5,16], over 200 refs
Literature reviews on cold start [6,11]
Cold start [17,18,19], + 270 references in [6,11]
Cold-start studies [9,20,21,22,23]
Cold-start studies [4,7,15,24,25,26,27,28,29,30]
Cold-start studies [31]
Table 2. Summary of mathematical notation used throughout the manuscript. Definitions of key symbols and variables used in model formulations.
Table 2. Summary of mathematical notation used throughout the manuscript. Definitions of key symbols and variables used in model formulations.
F ( ) ,   ϕ ( )   ψ ( ) Functional form of the RS and specification for the new-user and -item problems.
r i u Implicit or explicit rating of user u for item i .
U u   I i Representation vector of user u and item i via their metadata coordinates.
U u ~   I i ~ Encoded representation of user u and item i , for instance, via stereotypes.
μ u   λ i Characteristic biases of user u and item i .
μ ¯ ( U ~ u )   λ ¯ ( I i ~ ) Average bias exhibited by the users (items) of an encoding neighbor.
KTotal number of observed interactions of a user (with an item) in the system.
kkth interaction of a user (with an item) in the system.
μ u k + 1   λ i k + 1 Dynamic model for the user (item) bias at the kth interaction.
< μ u > 1 , . . , k Average user bias recorded for user u during the first k interactions.
< λ i > 1 , . . , k Average item bias recorded for item i during the first k interactions.
N u   N i ,   Number of interactions that are required to characterize a user (item).
γ Dynamic weight decay (parameter computed during optimization).
P i , j Probability of interacting with (consuming) item i after interacting with (consuming) j.
s j State vector at j for a user, u , representing recent interactions with the catalog.
R i , j Rate transition matrix of expected rating of item i after rating item j.
Table 3. Computational complexity of base and warm-up models. Estimated training and inference complexity for base models and warm-up components, including dynamic bias models (Equations (4) and (5)) and the transition model (Equation (6)), expressed in terms of catalog size C, user base U, embedding size E, and training iterations I.
Table 3. Computational complexity of base and warm-up models. Estimated training and inference complexity for base models and warm-up components, including dynamic bias models (Equations (4) and (5)) and the transition model (Equation (6)), expressed in terms of catalog size C, user base U, embedding size E, and training iterations I.
Training BaseTraining Model (4.5)Training Model (6)Inference
DNN L layers, N nodesO(I(U+C)E ((U+C)E+LN2))O(ILN2E)O(ILN2E2)O(UCELN2)
XGBoost T trees D depthO(IT((U+C)E log[(U+C)E+2D])O(ITDE)O(ITDE2)O(UCETD)
SVDO(I((U+C)E)2)O(IE)O(IE2)O(UCE)
Table 4. Practical use cases for warm-up models across application domains. Real-world scenarios highlighting how the proposed dynamic bias and transition models (Equations (4)–(6)) address cold-start and warm-up challenges in recommender system deployment.
Table 4. Practical use cases for warm-up models across application domains. Real-world scenarios highlighting how the proposed dynamic bias and transition models (Equations (4)–(6)) address cold-start and warm-up challenges in recommender system deployment.
DomainUse CaseSuggested
Model
Problems
Targeted
Key
Outcome
E-commerceProduct recommendations for new usersRS_base
+ Models (4), (6)
New-user
attrition. E.g.,
Amazon-reported attrition rates
Better warm-up characterization and more sales
E-commerceCold-launch of new productsRS_base
+ Models (5), (6)
Attrition rate of new sellers. E.g., Etsy ghost sellers.Improved exposure and early item ranking fairness
StreamingNew-user onboarding for media platformsRS_base
+ Models (4), (6)
Improved satisfaction of new users. E.g., Netflix.Early engagement without explicit preferences
News platforms
and
research feeds
Article recommendations for new readersRS_base
+ Models (4), (6)
Low CTR.Timely personalization of reader
Social mediaSuggestions for new users or contentRS_base
+ Models
(4), (5), (6)
Low user engagement and low content exposure.Onboarding of users, warm-up on their new content
Table 5. Dataset statistics and metadata dimensions.
Table 5. Dataset statistics and metadata dimensions.
DatasetNo. of UsersNo. of ItemsNo. of RatingsNo. of User FeaturesNo. of Item
Features
MovieLens + IMDb382760401,000,290535
Amazon Sports & Outdoor478,000532,1973,268,70039
Table 6. Performance metrics for the new-item experiment, warm-up, and dataset effect. Typical RMSE 95% CI = 0.0039 (MLens + IMDb), 0.0013 (Amazon). Typical MAE 95% CI = 0.0037 (MLens + IMDb), 0.0009 (Amazon). Typical NDCG 95% CI = 0.0097 (MLens + IMDb), 0.0022 (Amazon). Typical SER 95% CI = 0.017 (MLens + IMDb), 0.014 (Amazon).
Table 6. Performance metrics for the new-item experiment, warm-up, and dataset effect. Typical RMSE 95% CI = 0.0039 (MLens + IMDb), 0.0013 (Amazon). Typical MAE 95% CI = 0.0037 (MLens + IMDb), 0.0009 (Amazon). Typical NDCG 95% CI = 0.0097 (MLens + IMDb), 0.0022 (Amazon). Typical SER 95% CI = 0.017 (MLens + IMDb), 0.014 (Amazon).
DatasetModelRMSEMAENDCG_5NDCG_10SER_10
W = 5
MLens + IMDbStereo
Fully Warm
0.8876
0.8795
0.613
0.581
0.5093
0.5274
0.5103
0.5324
0.4719
0.5506
Amazon S & OStereo
Fully Warm
0.7730
0.8044
0.524
0.501
0.3805
0.4031
0.3896
0.4099
0.2317
0.2654
W = 10
MLens + IMDbStereo
Fully Warm
0.8853
0.8737
0.611
0.577
0.4995
0.5266
0.5009
0.5313
0.4785
0.5684
Amazon S & OStereo
Fully Warm
0.7693
0.7995
0.531
0.509
0.3814
0.4049
0.3830
0.4103
0.2388
0.2697
W = 15
MLens + IMDbStereo
Fully Warm
0.8817
0.8671
0.608
0.571
0.4915
0.5269
0.4913
0.5302
0.4813
0.5544
Amazon S & OStereo
Fully Warm
0.7615
0.7930
0.529
0.499
0.3895
0.4105
0.3855
0.4096
0.2395
0.2702
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

AlRossais, N. Novel Models for the Warm-Up Phase of Recommendation Systems. Computers 2025, 14, 302. https://doi.org/10.3390/computers14080302

AMA Style

AlRossais N. Novel Models for the Warm-Up Phase of Recommendation Systems. Computers. 2025; 14(8):302. https://doi.org/10.3390/computers14080302

Chicago/Turabian Style

AlRossais, Nourah. 2025. "Novel Models for the Warm-Up Phase of Recommendation Systems" Computers 14, no. 8: 302. https://doi.org/10.3390/computers14080302

APA Style

AlRossais, N. (2025). Novel Models for the Warm-Up Phase of Recommendation Systems. Computers, 14(8), 302. https://doi.org/10.3390/computers14080302

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop