Next Article in Journal
Adaptive Token Boundaries: Towards Integrating Human Chunking Mechanisms into Multimodal LLMs
Previous Article in Journal
PLSSEM Comparison Study of Mobile Payment Usage in Hong Kong and Mainland China: Factors Affecting the Popularity of Mobile Payment
Previous Article in Special Issue
Leveraging LLMs for User Rating Prediction from Textual Reviews: A Hospitality Data Annotation Case Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Industrial Framework for Cold-Start Recommendation in Few-Shot and Zero-Shot Scenarios

School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China
*
Author to whom correspondence should be addressed.
Information 2025, 16(12), 1105; https://doi.org/10.3390/info16121105
Submission received: 22 October 2025 / Revised: 5 December 2025 / Accepted: 12 December 2025 / Published: 15 December 2025

Abstract

With the rise of online advertising, e-commerce industries, and new media platforms, recommendation systems have become an essential product form that connects users with a vast number of candidates. A major challenge in recommendation systems is the cold-start problem, where the absence of historical interaction data for new users and items leads to poor recommendation performance. We first analyze the causes of the cold-start problem, highlighting the limitations of existing embedding models when faced with a lack of interaction data. To address this, we classify the features of models into three categories, leveraging the Trans Block mapping to transfer features into the semantic space of missing features. Then, we propose a model-agnostic industrial framework (MAIF) with the Auto-Selection serving mechanism to address the cold-start recommendation problem in few-shot and zero-shot scenarios without requiring training from scratch. This framework can be applied to various online models without altering the prediction for warm entities, effectively avoiding the “seesaw phenomenon” between cold and warm entities. It improves prediction accuracy and calibration performance in three cold-start scenarios of recommendation systems. Finally, both the offline experiments on real-world industrial datasets and the online advertising system on the Dazhong Dianping app validate the effectiveness of our approach, showing significant improvements in recommendation performance for cold-start scenarios.

1. Introduction

With the rise of short videos, e-commerce platforms, and new media platforms, recommendation systems play an essential product form that connects users with a large number of candidates. Recommendation systems are focused on attracting users, increasing user retention rate and stickiness, and improving conversion rate to achieve the company’s business goals of continuous growth. The reason why recommendation systems can work well and provide personalized recommendations for users is based on rich historical interaction information. However, current recommendation systems still have some problems that deserve attention. Firstly, the observed data in almost all recommendation systems is highly sparse, i.e., only a few entities are known [1]. Secondly, new users and items continue to emerge in real-world systems, often without any historical interaction data. These factors culminate in the fundamental challenge of serving users and items that lack sufficient interaction history [2], a scenario formally defined as the cold-start problem [3,4]. The problem describes the system’s difficulty in making accurate predictions for new entities, which manifests as both a few-shot (for entities with minimal data) and a zero-shot (for entities with no data) challenge.
Imagine what problems a recommendation system with a serious cold-start problem would encounter. (1) Decreased user experience: The lack of sufficient historical interaction data for new users or items leads to generic, irrelevant, or less diverse recommendations, which may struggle to understand new users’ interests and preferences, resulting in user churn. (2) Attrition of content creators: For short-video and new-media platforms, content creators cannot gain enough attention and exposure, thus reducing their motivation to continue creating content on the platform and falling into a vicious circle. (3) Limited marketing effectiveness: For e-commerce platforms or advertising systems, if the recommendation system cannot accurately recommend new products or contents, it may affect product sales and revenue. Similarly, for advertisers, if it fails to accurately target ads to intended users, marketing effectiveness will be limited.
Recently, the development of embedding techniques that generate compact, low- dimensional representations of features have significantly enhanced the accuracy of click-through rate (CTR) predictions and achieved the state-of-the-art performance in the industry [5,6]. In a sense, the embedding-based model may aggravate the cold-start problem for the following reasons: (1) The embedding-based model relies on interaction data between users and items to learn the latent factors. In cold-start situations, new users or items have little to no interaction data. When faced with new users or items, the model struggles to generate appropriate embeddings, as it was not exposed to similar data during training, which makes it difficult for the model to learn accurate and meaningful embeddings for them. (2) Sensitivity to initialization: In many embedding-based models, embeddings for users and items are initialized randomly before training. When online serving, encountering a new user or item, the initialization of the embedding is usually a zero vector. This makes the predictions in cold-start scenarios even more uncontrollable. Thus, some dropout-based models are proposed to mitigate the cold-start problem [7,8]. (3) Sensitivity to data distribution: The embedding-based model is sensitive to the distribution of the training data [5,9]. In cold-start situations, the available interaction data is often sparse and biased, which further aggravates the problem.
To solve the cold-start problem, it can be divided simply into two categories from the data perspective (i.e., historical interaction information): few-shot and zero-shot scenarios, respectively. The basic idea of solving the cold-start problem in a few-shot scenario is to use more available information. For example, some works obtain additional side information about users and items [10,11,12] or leverage warm features mapping to cold features with sparse interactions [3,13,14,15]. Other research works improve the model structure to reduce the influence of cold features, such as dropout models [7]. However, it relies on the assumption that cold features are randomly missing, and drops both the warm and cold features simultaneously. Hence, it can lead to the seesaw phenomenon [5]. This phenomenon is a classic manifestation of conflict, characterized by a trade-off between the cold-start and warm-up entities. Specifically, aggressively optimizing for cold-start entities often leads to a degradation in the performance of warm-up entities, and vice versa. This conflict arises because strategies effective for cold-start entities are typically inapplicable or suboptimal for warm-up entities. It introduces significant resistance to the implementation of the strategy, since warm-up entities typically constitute the majority of the data and traffic. Any performance degradation on this side creates considerable resistance, hindering the implementation of cold-start specific optimizations. Moreover, more complex models and strategies, which rely heavily on data fitting and understanding, are more sensitive to uninitialized features. Although this may not always manifest as a loss in accuracy, the resulting bias in predictions can be particularly detrimental. In practice, we observe that cold-start optimization in industrial scenarios often becomes a zero-sum game. This is primarily due to the difficulty in reconciling short-term and long-term metrics, as well as the differing perspectives and responsibilities between algorithm engineers and product or operations teams. These conflicts typically manifest as a reluctance to accept declines in metrics deemed important by one party, while the other focuses on different objectives. In the zero-shot scenario, recent studies attempt to transform the cold features into the same semantic representation as warm features to overcome the cold-start problem, with methods like sharing [16] or mapping [17].
A recommendation system in industry needs to be efficient and high-quality to solve the cold-start problem. It faces the following challenges: (1) How can this framework be applied to various online models at a relatively low cost. Many online models have been accumulating data day after day, and the experimental models with retraining may be difficult to match their performance, or require a lot of offline resources for retraining. Therefore, it would be very attractive if the proposed architecture could support hot-plugging. (2) How to avoid the seesaw phenomenon when using this framework. We hope to solve the cold-start problem for user or item entities without harming the warm entities. In short, we want to make up for the shortcomings of the recommendation system during the data accumulation period while ensuring fairness between cold and warm entities. (3) How to switch the strategies more dynamically during online inference, so that both few-shot and zero-shot scenarios can be overcome. In fact, many existing works often ignore it. Due to the different product scales, model update routine, or changes in request distribution, judging whether an entity is cold or warm is not always static, but needs to be interpretable. Therefore, efficient switching is necessary for the successful implementation in industry, avoiding the seesaw phenomenon.
Considering the above challenges, we propose a novel industrial framework for cold-start recommendation in few-shot and zero-shot scenarios. The contributions of this paper are as follows:
  • We propose a model-agnostic industrial framework (MAIF) that establishes a global semantic mapping from attribute features to the cold-start feature field. It can be applied to various online embedding-based models without altering the existing organization of training samples, significantly improving prediction accuracy and calibration performance in cold-start scenarios.
  • We design a non-invasive optimization strategy based on parameter reuse and gradient isolation. This approach enables MAIF to be “hot-plugged” into continuous training pipelines without retraining from scratch, effectively resolving the challenge of massive historical data accumulation while maximizing resource efficiency.
  • We validate the effectiveness of MAIF through extensive offline experiments on real-world datasets and rigorous online A/B testing in a large-scale industrial system. The results demonstrate that our framework achieves comprehensive coverage for both zero-shot and few-shot scenarios. Furthermore, it focuses on the seesaw phenomenon, ensuring that the adaptation to cold-start entities does not compromise the performance of warm-up entities, effectively reducing the negative impact on business indicators during deployment.
The rest of this paper is organized as follows. Section 2 provides a review of the related work in the field of cold-start recommendations. Section 3 describes the network structure and the Auto-Selection mechanism that enables dynamic strategy switching during online inference. Section 4 presents the experimental results, discussing the performance of our model in various cold-start scenarios and comparing it with existing methods. Finally, Section 5 concludes the paper and acknowledges the contributions of our collaborators and sponsors.

2. Related Work

Embedding-based models have achieved state-of-the-art performance in the industry [5,18,19]. All feature representations are mapped from raw features to a high-dimensional space. In addition, a well-optimized ID embedding can significantly enhance the performance of recommendation systems [14,20]. However, it also unavoidably introduces several challenges, such as overfitting and degraded performance in cold-start situations for both users and items. According to the number of interactions, we divide the cold-start problem into two scenarios: few-shot (only a few samples) and zero-shot (zero sample).
In a few-shot scenario, an intuitive idea of alleviating the problem is to utilize more available data. It is common to use side information [10,11,21] (e.g., user attributes, item attributes, relational data) or a knowledge graph to obtain potential preferences [22,23]. Recently, meta-learning [24] has been introduced as a solution to the few-shot cold-start problem. Moreover, model-agnostic meta-learning (MAML) is one of the powerful meta-learning methods to solve the problem in recommendation [2,25,26]. It learns how to learn new tasks based on a small number of examples by using prior experience and achieves fast adaptation towards new tasks. For example, MeLU [2] regards each user as a task and trains each task independently, and most MAML-based follow-up methods are based on it, such as MetaHIN [27] and MAMO [28]. These methods divide the model parameters into two groups for optimization. During the local update phase, it minimizes the training loss on the support set, while in the global update phase, it focuses on minimizing the test loss on the query set. MWUF [14] uses two meta networks to learn the embeddings of cold ID items. MEG [26] focuses on how to initialize and learn the ID embedding for new ads, and it is built on the pre-trained model. However, MAML ignores the fact that there are a vast number of cold-start IDs in the system each day (potentially millions), making it unacceptable for each task to have independent parameters and training processes.
In a zero-shot scenario, one approach to solving the cold-start problem is to design a decision-making strategy, such as multi-armed bandit algorithms [29,30] and reinforcement learning [31]. The idea is to benefit from the explore–exploit tradeoff. In embedding-based models, the problem can be conceptually framed as a latent feature imputation challenge where the unseen feature’s embedding is the missing element. From this perspective, feature imputation methods, such as MICE [32], which use iterative regression to model conditional missing variables, offer an alternative approach by statistically inferring the missing representation from observed attributes. Moreover, generative models [15,33] efficiently address the fundamental problem of insufficient data. For example, ref. [15] aims to estimate future user engagement, like the number of reviews, clicks, etc. Assuming we have an oracle that can predict all feature values accurately, the cold-start problem will be effectively addressed. MAIL [33] designs a zero-shot tower that uses dual autoencoders to generate virtual behavior data from highly aligned hidden features for new users. It enriches user profiles and improves recommendation accuracy, even in the absence of extensive user–item interaction data.
Furthermore, some methods have been applied in a hybrid scenario. By transferring and sharing information across different domains, the cross-domain technique has been proposed to alleviate data sparsity [3,16,34,35]. For example, DRL [34] focuses on leveraging knowledge transfer to address the cold-start problem for non-overlapping users in cross-domain recommendation scenarios. By integrating effective knowledge from source domains and optimizing the structure of subnetworks in the target domain, MSTL [34] outperforms in generalization and accuracy across benchmarks and industrial cases. Another method tackles the cold-start problem by using the dropout technique. For example, Dropout-Net [7] randomly selects a subset of users and items during training, dropping the interaction records to simulate the cold-start scenario. However, under the assumption that data are missing at random, Dropout-Net may introduce biased predictions [10] and the seesaw phenomenon.

3. Model-Agnostic Industrial Framework

3.1. Embedding Layer

In industry, the embedding technique plays a crucial role in transforming high-dimensional sparse features into lower-dimensional dense vectors. This process typically begins with feature hashing, where a hash function maps distinct features to indices within a fixed-size space, employing mechanisms to alleviate hash collisions. Subsequently, an embedding lookup operation retrieves corresponding vectors from an embedding table, using the hashed indices.
x [ i ] = e m b e d d i n g _ l o o k u p ( Ψ , i ) R m
Ψ denotes the embedding table, and i represents the index of a feature generated by the mapping function F ( · ) . In other words, the feature is encoded into an m-dimensional dense vector as x [ i ] using the e m b e d d i n g _ l o o k u p operation. Ψ is learned through model training, during which the semantic relationships among features are optimized using gradient-based optimization methods to minimize loss. The resulting embeddings are utilized as input to downstream tasks, providing a compact and semantically rich representation that improves computational efficiency and model performance.

3.2. Feature Classification

In recommendation systems, the features of incoming samples are sparsely observed in historical interactions. Consequently, if a feature field contains many unseen features, the dense vector derived from Equation (1) remains in its initial state, typically a zero or randomly initialized vector. We refer to this feature field as a missing feature, which is ubiquitous during both training and inference. Formally, we consider a standard data structure in which all samples have the same set of feature fields. A feature field refers to a specific attribute that represents a particular domain of feature values. Let F = { F 1 , F 2 , , F n } denote the universal set of such feature fields constituting the input space, which includes common types such as single-value, multi-value, and cross features. We partition F into three disjoint subsets:
  • Missing Feature ( F m i s s ): A subset F m i s s F representing the target fields (e.g., UserID or ItemID) characterized by high sparsity, where the latent information is frequently inaccessible due to containing many unseen features.
  • Transfer Feature ( F t r a n s ): A subset F t r a n s F designated to approximate the semantics of the missing features. These fields serve as the information source for reconstruction.
  • Common Feature ( F c o m m o n ): The set of remaining feature fields, defined as the F c o m m o n = F ( F m i s s F t r a n s ) .
Notably, the transfer feature can represent a missing feature with some degree of accuracy, as it inherently reflects real-world characteristics. For example, latitude and longitude or regional information may describe an IP address, because devices located in the same geographical area may belong to the same IP sub-network. When encountering a new device, the pre-trained generator initializes the embedding of IP by utilizing its attributes and context. Our study uses the meta-learning concept to learn and represent the embedding of missing features via transfer features. However, it is important to emphasize that MAIF is not a typical meta-learning training framework.

3.3. Preliminaries

CTR prediction is usually designed as a supervised binary classification. The basic information contained in each sample includes a user, an item, and context. Both the user and the item have many features to describe them. As mentioned above, the raw features are transformed into lower-dimensional dense vectors based on the embedding technique. Hence, after passing through the embedding layer, the feature field set F is transformed into a corresponding set of dense feature embeddings X = x 1 , , x n , where each x i represents the embedding representation of the field F i . The model predicts the probability p ^ by applying the function f ( · ) to the input feature embeddings X, where the output p ^ represents the predicted probability of the user clicking the item. The function f ( · ) depends on the model parameters θ , which are learned during the training process.
p ^ = f ( X ; θ )
where X is composed of three parts: X c o m m o n , X t r a n s , and X m i s s , which denote the embeddings of the common feature, transfer feature, and missing feature, respectively.
When encountering a never-observed feature that belongs to the specific feature field x k and has a hashed index i, x [ i ] cannot be retrieved from the embedding table Ψ . If x k X m i s s , the output embedding of X m i s s will be reorganized as follows:
ψ [ X m i s s ] = g X t r a n s , ϕ
ϕ is a set of parameters for the function g ( · ) , which is utilized X t r a n s to represent X m i s s . Then, Equation (2) will be reformulated as follows for the cold-start scenarios:
p ^ c o l d = f ( X c o m m o n , X t r a n s 1 , X m i s s 1 , X t r a n s 2 , X m i s s 2 , , X t r a n s n , X m i s s n ; θ ) f ( X c o m m o n , X t r a n s 1 , ψ [ X m i s s 1 ] , X t r a n s 2 , ψ [ X m i s s 2 ] , , X t r a n s n , ψ [ X m i s s n ] ; θ , ϕ )
It is worth noting that a model may contain several cold-start topics, where X t r a n s n and X m i s s n will appear in pairs to optimize one designed topic.
The cold-start topic is typically designed based on either the business experience or the feature distribution derived from profiling results. It refers to one or more missing features represented by a shared set of transfer features, with the missing features and transfer features denoted as X m i s s R M × K and X t r a n s R T × K , respectively. Here, M and K denote the number of missing features and the embedding dimension of each feature, respectively. Noted that different missing feature topics may have different values of M. Similarly, T is the number of transfer features. Although they are on the same topic, T and M may differ.
As an example, a model designed with two cold-start topics is presented in Figure 1. We use multi-task learning techniques to modify the model to be upgraded (a) by adding a new task as (b). As shown in the sub-graph (b), an example of user cold-start and item cold-start topics is illustrated, with the user missing feature as UserID and item missing feature as ItemID, respectively.

3.4. Loss Function

We define the loss function of the off-the-shelf recommendation model as L . For example, the cross-entropy loss is widely used in CTR prediction models. The model is optimized by minimizing the loss between the predicted probability p ^ and the corresponding binary label y { 0 , 1 } , where y { 0 , 1 } , representing whether the user clicked the item.
L ( θ ) = y log p ^ ( 1 y ) log ( 1 p ^ ) θ * = argmin θ L ( θ )
Similarly, the cold-start loss L c o l d is calculated using the prediction p ^ c o l d . Crucially, while the loss depends on both parameter sets ( θ , ϕ ) , it is optimized only over ϕ .
L c o l d ( θ , ϕ ) = y log p ^ c o l d ( 1 y ) log ( 1 p ^ c o l d ) ϕ * = argmin ϕ L c o l d ( θ , ϕ )
It should be noted that the dataset trained by L c o l d is limited to cold-start samples rather than the full dataset, which will be discussed further in Section 4.
Furthermore, L a u x i l i a r y is proposed not only to enhance the Equation (3) fitting ability for X m i s s but also to enable the reuse of network parameters. For instance, cosine similarity loss can be simply used as L a u x i l i a r y , which can be calculated as follows:
L a u x i l i a r y ( θ , ϕ ) = X m i s s X m i s s 2 × ψ [ X m i s s ] ψ [ X m i s s ] 2 ϕ * = ×   a r g m i n ϕ L a u x i l i a r y ( θ , ϕ )
L a u x i l i a r y ranges from −1 to 1, with 0 representing orthogonality, values approaching −1 reflecting higher similarity, and values near 1 indicating higher dissimilarity. Similar to Equation (6), it is optimized only with respect to ϕ . The operation v 2 represents the L2-norm, defined as the square root of the summation of vector squared element magnitudes. Here, ϵ is a small positive constant added to prevent division by zero.
v 2 = max v i 2 , ϵ
Both L c o l d and L a u x i l i a r y losses facilitate the alignment between transfer and missing features. This joint optimization learns the parametric mapping defined in Equation (3), effectively mapping transfer features into the semantic space of missing features.

3.5. Network Structure

The architecture of our framework is summarized in Figure 1. We apply multi-task learning techniques to divide the model into two tasks, as shown in (a) and (b). This approach allows the full sample set and the cold-start sample set to be trained within their independent tasks, respectively.
Firstly, an embedding layer converts each raw input field into a fixed-length dense vector. Secondly, different processes are applied to the three types of features, as described in Equations (1) and (3). Compared with the two concat layers in Figure 1, they only differ in the embedding semantic space for the missing features, while retaining the same embedding dimensions. The Trans Block, as a sub-model with parameters ϕ , can be designed with various structures. However, this is not the focus of our study. We select two fully connected layers as the basic structure.
Both the upper dense layers of the base and cold-start tasks share the parameters of all hidden layers. In the cold-start task, the updates of dense and embedding layers are frozen during training, while only the parameters of the Trans Block are updated based on the trans feature. Taking into account the distribution differences of cold-start samples, and based on our practical experience, the cold-start task achieves better performance when trained only with cold-start samples. The generation of these samples is detailed in Section 3.6. Dashed arrows indicate forward computation only, without backward propagation updates. Theoretically, the structure and parameters of the warm-up task are identical to the baseline model, assuming no influence from random factors. Algorithm 1 describes the offline training phase.
The benefit of hidden layer sharing is that, when research personnel implement strategy upgrades, the model does not need to be trained from scratch. This approach reduces the training cost, requires less data, and results in faster model convergence. Moreover, it provides higher experimental efficiency and more stable experimental outcomes. This is because many online models accumulate data day after day in the industry. Retraining a model with limited data may struggle to match the performance of an online model. Therefore, some tricks of the trade are used in practical work, such as initializing only the newly added parameters while loading the remaining parameters from a baseline model.

3.6. Auto-Selection

In this subsection, we discuss how to generate a proper prediction value for various scenarios with real-time online traffic. For example, when a new user enters the recommendation system, the pre-defined cold-start topic for users will be triggered. If the user ID is designed as the missing feature, its embedding will be generated by the Trans Block and become a part of the concat layer in the cold-start task. It then participates in forward propagation, ultimately producing the CTR prediction p ^ c o l d .
The difference between online inference and offline training lies in the activation of Trans Blocks for each topic. During online inference, whether these blocks are activated is determined by traffic conditions. When none of them are triggered, the value of p ^ and p ^ c o l d remain the same. Theoretically, only p ^ c o l d will ultimately be used in the system, while p ^ will be logged for further analysis. The process of selecting proper Trans Blocks for activation, i.e., determining which cold-start topics to involve to produce p ^ c o l d , is referred to as Auto-Selection. Importantly, the threshold τ for online Auto-Selection can be decoupled from the offline training phase, offering the flexibility to fine-tune the intervention scope based on specific business needs.
Algorithm 1 Offline Training for Base and Cold-Start Tasks
Require:
    α , β , γ : learning rates for gradient updates;
    θ , ϕ : parameters of base task and Trans Block in cold-start task;
    Ψ : embedding table;
    D : dataset sorted by timestamp;
    O p t ( · , · , · ) : the gradient-based optimization function;
  1:
Randomly initialize θ , ϕ ;
  2:
for each mini-batch B in D  do
  3:
   Set B c o l d as empty
  4:
   for each sample ∈ B  do
  5:
      Generate embedding layer of base task with Equation (1);
  6:
      Generate embedding layer of cold-start task with Equation (3), triggered by Equation (9);
  7:
      if any Trans Block is triggered then
  8:
          Add sample to B c o l d
  9:
      end if
10:
   end for
   /* Base task processing */
11:
   Evaluate base task loss L with corresponding p ^ ;
12:
   Compute gradients θ L and Ψ L ;
13:
   Update θ , Ψ using optimizer O p t ( θ , θ L , α ) and O p t ( Ψ , Ψ L , β );
   /* Cold-start task processing */
14:
   Evaluate cold-start task loss L c o l d and L a u x i l i a r y with corresponding p ^ c o l d for B c o l d ;
15:
   Compute gradients ϕ L c o l d and ϕ L a u x i l i a r y ;
16:
   Update ϕ using optimizer O p t ( ϕ , ϕ L c o l d , γ ) and O p t ( ϕ , ϕ L a u x i l i a r y , γ ).
17:
end for
Based on our experience and experimental results, we propose two strategies to determine the conditions for triggering a specific cold-start topic. One method uses the L2-norm of the missing feature embedding retrieved from the embedding table Ψ , as given in Equation (8). Another way is to record the impression or click count of each feature in X m i s s during training, and then save it in a format like TensorFlow SavedModel. This will be implemented in the training framework as feature counters, requiring additional storage overhead.
Both methods are compared against a predefined threshold that triggers the cold-start topic. When the value is below the threshold during online serving, it is assumed that the model has not fully learned the feature, even though the model has encountered the feature a limited number of times. Note that one or more cold-start topics may be activated within a request. For example, Equation (9) shows whether a cold-start topic is activated using the first method. Where O T ( i ) is the output of the i-th Trans Block, and the threshold τ is the L2-norm embedding length. A detailed explanation of the online serving phase is provided in Algorithm 2.
O T ( i ) = X m i s s ( i ) X m i s s ( i ) 2 τ ψ [ X m i s s ( i ) ] X m i s s ( i ) 2 < τ
Algorithm 2 Online Serving with Auto-Selection
Require:
       p v : online request with metadata and attributes;
  1:
Feature extraction from pv;
/* Base task processing */
  2:
Generate embedding layer of base task with Equation (1);
  3:
Obtain p ^ via forward propagation for the base task;
/* Cold-start task processing */
  4:
for each Trans Block do
  5:
    if Trans Block is triggered then
  6:
        Reorganize embedding layer using ϕ [ X m i s s ] ;
  7:
    end if
  8:
end for
  9:
Obtain p ^ c o l d via forward propagation using above embedding layer for cold-start task;
/* If none of Trans Blocks are triggered, p ^ and p ^ c o l d hold the same value */
10:
Return p ^ c o l d as the final prediction, with p ^ as optional.

4. Experiments and Discussion

We conduct experiments on an open-source dataset for offline analysis. Additionally, an online A/B test is carried out in a well-known advertising system. Both approaches validate the effectiveness of our proposed MAIF in addressing the cold-start problem. Especially in real-world industrial scenarios, the rigorous online tests comprehensively demonstrate its performance and the efficiency it brings to upgrading existing models.

4.1. Offline Experiment

4.1.1. Datasets

To comprehensively evaluate our method, particularly its performance in real-world cold-start scenarios, we conduct experiments on two datasets. Primarily, we utilize the KuaiRec dataset [1], a real-world industrial dataset derived from Kuaishou, a famous social video-sharing App with billions of users. This dataset is specifically chosen for its rich feature space suitable for addressing cold-start issues. Owing to its unique design and data collection mechanism, this dataset is particularly well suited for evaluating cold-start problems. The dataset contains interactions from 7176 users and 10,728 items, resulting in a total of 12,530,806 interactions. In addition, to verify the generalizability of our model on a standard benchmark, we also employ MovieLens-1M [36], a widely used dataset in recommender systems. It comprises approximately 1 million explicit ratings generated by over 6000 users across nearly 4000 items.
KuaiRec provides not only fully observed but also sparsely observed interaction information from an online environment. The fully observed data serves as a perfect warm-up scenario, with each of the 1411 users having been exposed to all 3327 items. While the sparsely observed data offers three cold-start scenarios, as illustrated in Figure 2. From the perspective of the samples in the red dashed box representing the warm-up scenario in Figure 2, the three black dashed boxes illustrate the samples for the three cold-start scenarios, which can be summarized as follows:
  • User cold-start scenario (upper right): existing items recommended to new users.
  • Item cold-start scenario (lower left): new items recommended to existing users.
  • User–Item cold-start scenario (upper left): new items recommended to new users.
The (1) and (2) scenarios are considered few-shot scenarios, while scenario (3) is a zero-shot scenario.

4.1.2. Baseline Models

We evaluate MAIF by comparing it with three categories of methods: (1) traditional methods, (2) dropout-based methods, and (3) meta-learning methods.
MLP [37] is a type of artificial neural network that consists of multiple layers of interconnected nodes. Each node in one layer is connected to every node in the adjacent layer. It is considered one of the most fundamental models in the industry.
DeepFM [38] is a hybrid model combining Factorization Machines (FMs) for feature interaction and deep neural networks (DNN) for capturing high-order interactions. The FM component models pairwise interactions, while the DNN handles complex high-order interactions. This combination leverages both shallow and deep interactions, making DeepFM well-suited for recommendation systems.
Wide&Deep [39] is a hybrid model that integrates a wide linear model and a deep neural network to leverage the strengths of both approaches. This combination allows it to effectively utilize both shallow and deep feature interactions, making it well-suited for recommendation systems.
MICE [32] is a classic feature imputation method that handles missing data by iteratively modeling the conditional distribution of each missing variable, progressively improving the accuracy of imputations with each iteration.
Dropout-Net [7] incorporates concepts from denoising autoencoders by training the model with the dropout technique, promoting better generalization and robustness in recommendation systems.
MeLU [2], as a Model-Agnostic Meta-Learning (MAML) algorithm, can estimate new user’s preferences with only a few consumed items. It has a good personalized recommendation ability with a unique item-consumption history.
Meta-Embedding Generator [26] (MEG) is also an MAML algorithm that focuses on how to initialize and learn the ID embedding for new ads. It is built on a pre-trained model that includes both the cold-start and warm-up phases.

4.1.3. Evaluation Metrics

The AUC metric [40] is widely used in recommendation and advertising systems. It measures the goodness of order by ranking all the items with predictions. In addition, we also adopt the relative improvement [41] (RelaImp) metric for a better comparison. It removes the constant part (0.5) from the AUC value, which represents random guessing, and then computes the relative improvement. Hence, the RelaImpr is defined as follows:
RelaImp = AUC ( experimental model ) 0.5 AUC ( base model ) 0.5 1 × 100 % .
To the best of our knowledge, many previous works on cold-start have overlooked another important metric, which only focuses on the ranking effect, but it is unrelated to the absolute value of the model’s predicted score. In Optimized Cost Per X (oCPX) [42] billing mode in advertising systems, the prediction not only determines the position of the ads but also impacts the cost. Similarly, in recommendation systems, due to the presence of multi-objective ranking, a single estimated value can impact other metrics, leading to a zero-sum effect. Therefore, we use the predicted CTR over the true CTR (PCOC) [43] to measure the calibration performance of each method. The closer the PCOC is to 1.0, the more accurate it is. If it exceeds 1, the prediction is overestimated. Otherwise, it is underestimated, which can be calculated as follows:
PCOC = i = 1 D p ^ i i = 1 D y i

4.1.4. Offline Experimental Settings

Dataset Settings. To present the experimental results more clearly and simulate real-world deployment conditions, the KuaiRec training dataset is split along two dimensions. First, the dataset is divided into four scenarios, as detailed in Section 4.1.1. It is important to note that we only train on the warm-up data, which is represented by the red dashed box in Figure 2. This ensures that new entities are naturally isolated. For example, samples within the user cold-start scenario, which involve new user entities, are strictly absent from the warm-up scenario. Second, the warm-up data will be strictly split chronologically into a training set and a test set with a 9:1 ratio. Similarly, we will sample equally sized test sets from the data of the other three cold-start scenarios, which together form the complete test set for all four scenarios. For the MovieLens-1M dataset, we also strictly maintain the chronological order. To ensure sufficient data for partitioning the test set into four scenarios, we set the split ratio between the training and test sets at 8:2.
We designed two cold-start topics based on the characteristics of the KuaiRec and MovieLens-1M datasets, respectively. The feature design for each cold-start topic is detailed in Table 1. Following common practices in recommendation system evaluation [1,14], both datasets are adapted for the CTR prediction through label binarization. Specifically, positive labels ( y = 1 ) are assigned when watch _ ratio 2 in KuaiRec and ratings 4 in MovieLens-1M, with the remaining instances labeled as y = 0 .
Parameter Settings. To ensure a fair comparison, we keep all the model parameters as consistent as possible. We fix the embedding size (K) as 4 and the DNN with [256, 64, 32, 1] hidden units. For the Trans Block, we employ two fully connected layers as the basic structure with ReLU activation, configured with [ T × K × 2 , M × K ] units. This structure maps the transfer features X t r a n s R T × K to the semantic space of missing features X m i s s R M × K . The learning rate is set to 2 × 10 5 , and the batch size is 128. All models are trained for a maximum of 10 epochs with early stopping. The dropout rate (the probability of dropping) is set to 0.2 for Dropout-Net. To adapt MICE for embedding-based models, we flattened the embedding vectors of all transfer features and the missing feature, treating each embedding dimension as an independent variable. We then employed Bayesian Ridge Regression within MICE to statistically impute the missing feature values.
To account for the variations in data sparsity and feature distribution across different datasets, we computed the percentiles of positive interaction counts for item features in both the KuaiRec and MovieLens-1M datasets, as shown in Figure 3. The click counts corresponding to these percentiles serve as thresholds τ to further evaluate the experimental performance.

4.1.5. Experiment Results

For the overall evaluation, we first fix τ by selecting the average click number corresponding to the 20th percentile of items, based on the specific data distribution of each dataset, as shown in Figure 3. Subsequently, we examine the performance of our proposed MAIF under varying quantile thresholds. The offline experimental results on KuaiRec and MovieLens-1M datasets in four scenarios are shown in Table 2 and Table 3. In the fully observed warm-up scenario, there is no marked difference in AUC metrics between the models, and the PCOC metrics tend to be close to 1. MEG, Dropout-Net, and MEG perform worse than MLP. Because of the lack of gradient isolation in cold-start samples, meta-learning methods exhibit a seesaw phenomenon on performance metrics. Dropout-Net loses some necessary information due to the random feature dropouts during training. Our proposed model has considered the seesaw issue at the beginning of the design, and it has the same effect as its various implementations in this scenario.
In the few-shot scenarios, either the user or the item is a new entity. Dropout-Net achieves notable improvement in PCOC, benefiting from the forced adaptation of initialization vectors during training. Meanwhile, MICE, which employs an imputation-based approach to complete missing features, yields improvements in both AUC and PCOC. However, its gain in PCOC is not as significant as that of Dropout-Net. Meta-learning methods such as MeLU and MEG have shown certain advantages in these scenarios. For example, MEG uses an embedding generator to address the initialization of features. However, it is also important to point out that both MeLU and MEG treat each missing feature as a task with independent parameters in their training process. It is well known that recommendation systems contain a large number of unique identifiers for users and items (e.g., hundreds of millions). This approach results in overly sparse recommendation systems, potentially leading to underfitting. As a result, meta-learning methods outperform traditional approaches in terms of AUC and PCOC metrics. However, our proposed model performs even better.
In the zero-shot scenario, both the user and the item are unseen entities. Expectedly, traditional methods suffer from poor performance in this scenario. In contrast, benefiting from designs applicable to zero-shot scenarios, Dropout-Net achieves improvements primarily in calibration performance, while MICE demonstrates gains mainly in prediction accuracy. However, MeLU and MEG, designed with a few-shot goal, have also shown their limitations. Although the authors did not mention how to handle zero-shot scenarios, we use average pooling for the parameters of all the tasks to deal with the unseen features. Under the design based on Equations (6) and (7), our model outperforms in various metrics in the zero-shot scenario.
The four scenarios cannot be directly compared due to the sample selection bias. For example, it makes no sense to directly compare the AUC performance between the item and user cold-start scenarios. Comparisons can only be made to evaluate the performance of different models within each scenario. In summary, traditional models fail to overcome the decline of ranking and calibration performance in cold-start scenarios. Although dropout techniques improve calibration performance, they achieve this through random feature dropping that damages the integrity of historical interactions. This trade-off evidently harms the ability to rank, as it compromises warm-up accuracy to support cold entities. While meta-learning methods demonstrate effectiveness in few-shot scenarios via an ’ID-as-Task’ formulation, they often suffer from high computational complexity due to bi-level optimization. In contrast, MAIF adopts a multi-task learning paradigm augmented by an Auto-Selection mechanism. This design enables scalable, dynamic inference across the missing feature field, maintaining acceptable computational complexity while effectively addressing both zero-shot and few-shot scenarios. Benefiting from this unified framework, MAIF empirically outperforms all baselines across the three cold-start settings. Furthermore, its joint optimization mechanism inherently mitigates the seesaw phenomenon, making training and deployment significantly more cost-effective.
To verify key components, we added two ablation studies in Table 2 and Table 3 for three cold-start scenarios. Without reusing base task parameters, the model suffers significant accuracy degradation, as training on limited cold-start data introduces sample selection bias and leads to underfitting. Furthermore, excluding the auxiliary loss leads to lower accuracy and calibration performance, confirming its critical role in facilitating better feature alignment. In addition, Figure 4 illustrates the performance of MAIF (Wide&Deep) across three cold-start scenarios with varying thresholds τ for activating the Trans Block. To account for the variations in data sparsity and feature distribution, the values of τ are derived from the quantiles of item click counts specific to each dataset. It can be observed that incorporating more training samples into the cold-start task improves prediction accuracy. However, as the threshold τ increases, the ground-truth CTR of the selected samples rises significantly, as shown in Figure 3, leading to a gradual increase in prediction bias within the cold-start scenarios.

4.2. Online A/B Test

The online A/B test was conducted on the Dazhong Dianping (DZDP) app, the most popular online rating and deal service platform in China. All samples were collected from in-feed native ads on the homepage over a continuous period of time. As is well known, in online A/B testing, samples were strictly organized in chronological order, ensuring that we could effectively evaluate various scenarios.

4.2.1. Online Experimental Settings

Offline Training. The model, which includes 2 billion embedding keys and 30 million parameters, was trained on 30 million samples per day, with each sample representing an exposed ad and containing over 300 feature fields. The label indicates whether the user clicked the ad. Convergence was reached after approximately 200 days of training data, with each iteration taking about 1 day to complete, and the model was trained on 100 CPU nodes. In the experiment, we added four cold-start topics to the existing model: u s e r _ i d , i t e m _ i d , p l a n _ i d , and c r e a t i v e _ i d , each representing a missing feature for the respective topic. p l a n _ i d is a unique identifier for an advertising campaign, and c r e a t i v e _ i d is a unique identifier for a specific ad creative. The model can be simply regarded as a four-layer MLP network, where user and item features are the two main types. The experimental model was warm-started from the baseline historical model and continued training with the most recent 30 days, requiring approximately 4 h for training. Because the majority of parameters are reused, it results in only a slight 6% increase in training time compared to the baseline. Considering that full retraining using 200 days of data, the training time increase caused by the additional architectural overhead is negligible. Thereafter, both the experimental and baseline models (control group) were updated daily. This approach reduces the training resources and improves the consistency of the models. As a result, it further guarantees the reliability of the experimental results. For each Trans Block of cold-start topics, we also employ two fully connected layers as the basic structure with ReLU activation, configured with [ T × K × 4 , M × K ] units, where the embedding size K is set to 8. Compared to the massive capacity of the model to be upgraded, which hosts 2 billion embedding keys and 30 million dense parameters, the additional network parameters introduced by the Trans Blocks are negligible at less than 0.01%.
Online Deployment. Once the model is deployed and enters the online inference stage, Auto-Selection will begin working. We compared the two methods mentioned in Section 3.6, and here we use the one with the best performance and stability: recording the click count of each missing feature. When the count is less than or equal to 5, we use the corresponding activated Trans Block output. This threshold should be adjusted based on the specific business requirements. Therefore, the entire implementation process of Auto-Selection can be completed within the model itself, without the need for collaboration with the engineering team.
Experimental Parameters. Following conventional industrial practice, the experimental and control groups were divided via a randomized u s e r _ i d hashing bucket split. For experimental safety, traffic is gradually increased during the experiment. Once each group stabilizes at 10% of total users, we will observe the two-week online A/B test results, which will also serve as the final results for launch and experimental analysis.

4.2.2. A/B Test Results

Table 4 shows the A/B test results for cold-start scenarios. Compared to the control model, the PCOC metric improved from 0.9229 to 1.0415, and the AUC increased by 0.02 in cold-start scenarios. At the same time, the traffic to the online experiment resulted in a 20.3% increase in impressions and a 22.05% increase in clicks. More importantly, despite the significant increase in impressions, the CTR still maintained a 1.39% increase driven by the AUC improvement.
In addition to these business metrics, we also monitored core performance metrics for online systems, such as prediction latency and failure rate, as shown in Table 5. In the table, TP50, TP99, and TP999 represent the minimum times under which 50%, 99%, and 99.9% of the requests were served, respectively. The failure column lists the failure rate for model predictions. The results show that while the experimental model exhibits slightly higher latency in terms of TP50 and TP99 compared to the control model, the TP999 latency remained identical. The failure rate was equally low for both models. Collectively, these results demonstrate that the performance overhead introduced by MAIF to address the cold-start problem is limited.
Through observation of the AUC during the advertising item lifecycle over one month in both the control and experimental groups, Figure 5 shows that the experimental group exhibits a more rapid convergence in precision during the cold-start phase (first week). From a global perspective (the four scenarios), the experimental group achieved a 3.18% increase in Revenue Per Search (RPS) and a 0.96% improvement in CTR. Regarding user experience metrics, the UVCTR increased by 0.38%. Despite the significant increase in impressions in cold-start scenarios, the CTR did not decrease, and the overall UVCTR improved, effectively reducing the resistance to deployment. This also demonstrates the advantage of our approach in alleviating the seesaw phenomenon encountered when solving the cold-start issue. More importantly, the performance of the recommendation advertising system in cold-start scenarios can significantly reduce budget loss [44], which has also been verified in our practical work.

5. Conclusions

In this paper, we proposed a novel model-agnostic industrial framework (MAIF) to solve the cold-start problem in both few-shot and zero-shot scenarios. MAIF utilizes multi-task techniques, Trans Block, and loss function design to effectively avoid the “seesaw phenomenon” between cold and warm entities. This framework not only ensures efficient handling of new entities with limited or no data but also improves the prediction accuracy and calibration performance in cold-start situations. We validated MAIF through extensive offline experiments on real-world industrial datasets and an online advertising system in the DZDP app, incorporating the Auto-Selection serving mechanism. The results demonstrate significant improvements in both prediction accuracy and calibration performance, confirming the effectiveness of MAIF in addressing the challenges associated with cold-start recommendations.
Although the MAIF framework effectively mitigates cold-start problems, there are several points for future work. First, the design of cold-start topics currently depends on data profiling or the algorithm engineer’s sensitivity to the business. Automating these processes could improve efficiency. Second, the logic of selecting training samples for cold-start tasks could be refined to better adapt to dynamic data distributions. Finally, the Trans Block network structure offers room for further optimization, where more detailed designs could enhance the modeling of relationships between missing and transfer features, leading to improved performance.

Author Contributions

Conceptualization, X.C.; methodology, X.C.; software, X.C.; validation, W.Z. and F.J.; formal analysis, X.C.; investigation, X.C.; resources, X.C.; data curation, X.C.; writing—original draft preparation, X.C.; writing—review and editing, W.Z. and F.J.; visualization, X.C.; supervision, X.Z.; project administration, X.Z.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions. This work was supported in part by the National Key Research and Development Program of China under Grant 2020YFB2103803.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available at https://doi.org/10.1145/3511808.3557220, reference number [1]. These data were derived from the following resources available in the public domain: https://kuairec.com/.

Acknowledgments

We wish to acknowledge the anonymous referees who gave precious suggestions to improve the work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Gao, C.; Li, S.; Lei, W.; Chen, J.; Li, B.; Jiang, P.; He, X.; Mao, J.; Chua, T.S. KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 540–550. [Google Scholar]
  2. Lee, H.; Im, J.; Jang, S.; Cho, H.; Chung, S. Melu: Meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1073–1082. [Google Scholar]
  3. Guan, R.; Pang, H.; Giunchiglia, F.; Liang, Y.; Feng, X. Cross-Domain Meta-Learner for Cold-Start Recommendation. IEEE Trans. Knowl. Data Eng. 2022, 35, 7829–7843. [Google Scholar] [CrossRef]
  4. Sánchez-Moreno, D.; López Batista, V.F.; Muñoz Vicente, M.D.; Sánchez Lázaro, Á.L.; Moreno-García, M.N. Social network community detection to deal with gray-sheep and cold-start problems in music recommender systems. Information 2024, 15, 138. [Google Scholar] [CrossRef]
  5. Chen, H.; Wang, Z.; Huang, F.; Huang, X.; Xu, Y.; Lin, Y.; He, P.; Li, Z. Generative adversarial framework for cold-start item recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 2565–2571. [Google Scholar]
  6. Niu, C.; Wu, F.; Tang, S.; Hua, L.; Jia, R.; Lv, C.; Wu, Z.; Chen, G. Billion-scale federated learning on mobile clients: A submodel design with tunable privacy. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, London, UK, 21–25 September 2020; pp. 1–14. [Google Scholar]
  7. Volkovs, M.; Yu, G.; Poutanen, T. Dropoutnet: Addressing cold start in recommender systems. Adv. Neural Inf. Process. Syst. 2017, 30, 4964–4973. [Google Scholar]
  8. Zheng, J.; Ma, Q.; Gu, H.; Zheng, Z. Multi-view denoising graph auto-encoders on heterogeneous information networks for cold-start recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 2338–2348. [Google Scholar]
  9. Antoniak, M.; Mimno, D. Evaluating the stability of embedding-based word similarities. Trans. Assoc. Comput. Linguist. 2018, 6, 107–119. [Google Scholar] [CrossRef]
  10. Briand, L.; Salha-Galvan, G.; Bendada, W.; Morlon, M.; Tran, V.A. A semi-personalized system for user cold start recommendation on music streaming apps. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 2601–2609. [Google Scholar]
  11. Zhao, W.X.; Li, S.; He, Y.; Chang, E.Y.; Wen, J.R.; Li, X. Connecting social media to e-commerce: Cold-start product recommendation using microblogging information. IEEE Trans. Knowl. Data Eng. 2015, 28, 1147–1159. [Google Scholar] [CrossRef]
  12. Herce-Zelaya, J.; Porcel, C.; Tejeda-Lorente, Á.; Bernabé-Moreno, J.; Herrera-Viedma, E. Introducing CSP dataset: A dataset optimized for the study of the cold start problem in recommender systems. Information 2022, 14, 19. [Google Scholar] [CrossRef]
  13. Zheng, X.; Liu, W.; Chen, C.; Su, J.; Liao, X.; Hu, M.; Tan, Y. Mining User Consistent and Robust Preference for Unified Cross Domain Recommendation. IEEE Trans. Knowl. Data Eng. 2024, 36, 8758–8772. [Google Scholar] [CrossRef]
  14. Zhu, Y.; Xie, R.; Zhuang, F.; Ge, K.; Sun, Y.; Zhang, X.; Lin, L.; Cao, J. Learning to warm up cold item embeddings for cold-start recommendation with meta scaling and shifting networks. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Montréal, QC, USA, 11–15 July 2021; pp. 1167–1176. [Google Scholar]
  15. Haldar, M.; Ramanathan, P.; Sax, T.; Abdool, M.; Zhang, L.; Mansawala, A.; Yang, S.; Turnbull, B.; Liao, J. Improving deep learning for airbnb search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 2822–2830. [Google Scholar]
  16. Wu, T.; Chio, E.K.I.; Cheng, H.T.; Du, Y.; Rendle, S.; Kuzmin, D.; Agarwal, R.; Zhang, L.; Anderson, J.; Singh, S.; et al. Zero-shot heterogeneous transfer learning from recommender systems to cold-start search retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; pp. 2821–2828. [Google Scholar]
  17. Huan, Z.; Zhang, G.; Zhang, X.; Zhou, J.; Wu, Q.; Gu, L.; Gu, J.; He, Y.; Zhu, Y.; Mo, L. An Industrial Framework for Cold-Start Recommendation in Zero-Shot Scenarios. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 3403–3407. [Google Scholar]
  18. Le, N.L.; Abel, M.H.; Gouspillou, P. Combining Embedding-Based and Semantic-Based Models for Post-Hoc Explanations in Recommender Systems. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 4619–4624. [Google Scholar]
  19. Huang, J.T.; Sharma, A.; Sun, S.; Xia, L.; Zhang, D.; Pronin, P.; Padmanabhan, J.; Ottaviano, G.; Yang, L. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 2553–2561. [Google Scholar]
  20. Chen, Y.; Huzhang, G.; Zeng, A.; Yu, Q.; Sun, H.; Li, H.Y.; Li, J.; Ni, Y.; Yu, H.; Zhou, Z. Clustered Embedding Learning for Recommender Systems. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1074–1084. [Google Scholar]
  21. Mo, K.; Liu, B.; Xiao, L.; Li, Y.; Jiang, J. Image feature learning for cold start problem in display advertising. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
  22. Du, Y.; Zhu, X.; Chen, L.; Fang, Z.; Gao, Y. Metakg: Meta-learning on knowledge graph for cold-start recommendation. IEEE Trans. Knowl. Data Eng. 2022, 35, 9850–9863. [Google Scholar] [CrossRef]
  23. Kuznetsov, S.; Kordík, P. Overcoming the cold-start problem in recommendation systems with ontologies and knowledge graphs. In Proceedings of the European Conference on Advances in Databases and Information Systems, Barcelona, Spain, 4–7 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 591–603. [Google Scholar]
  24. Li, J.; Chiu, B.; Feng, S.; Wang, H. Few-shot named entity recognition via meta-learning. IEEE Trans. Knowl. Data Eng. 2020, 34, 4245–4256. [Google Scholar] [CrossRef]
  25. Pang, H.; Giunchiglia, F.; Li, X.; Guan, R.; Feng, X. PNMTA: A pretrained network modulation and task adaptation approach for user cold-start recommendation. In Proceedings of the ACM Web Conference 2022, Barcelona, Spain, 26–29 June 2022; pp. 348–359. [Google Scholar]
  26. Pan, F.; Li, S.; Ao, X.; Tang, P.; He, Q. Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 695–704. [Google Scholar]
  27. Lu, Y.; Fang, Y.; Shi, C. Meta-learning on heterogeneous information networks for cold-start recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data mMining, Virtual, 6–10 July 2020; pp. 1563–1573. [Google Scholar]
  28. Dong, M.; Yuan, F.; Yao, L.; Xu, X.; Zhu, L. Mamo: Memory-augmented meta-optimization for cold-start recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 688–697. [Google Scholar]
  29. Felício, C.Z.; Paixão, K.V.; Barcelos, C.A.; Preux, P. A multi-armed bandit model selection for cold-start user recommendation. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, Bratislava, Slovakia, 9–12 July 2017; pp. 32–40. [Google Scholar]
  30. Wang, Q.; Zeng, C.; Zhou, W.; Li, T.; Iyengar, S.S.; Shwartz, L.; Grabarnik, G.Y. Online interactive collaborative filtering using multi-armed bandit with dependent arms. IEEE Trans. Knowl. Data Eng. 2018, 31, 1569–1580. [Google Scholar] [CrossRef]
  31. Fu, M.; Huang, L.; Rao, A.; Irissappane, A.A.; Zhang, J.; Qu, H. A deep reinforcement learning recommender system with multiple policies for recommendations. IEEE Trans. Ind. Inform. 2022, 19, 2049–2061. [Google Scholar] [CrossRef]
  32. Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
  33. Feng, P.J.; Pan, P.; Zhou, T.; Chen, H.; Luo, C. Zero shot on the cold-start problem: Model-agnostic interest learning for recommender systems. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 474–483. [Google Scholar]
  34. Duan, H.; Meng, X.; Tang, J.; Qiao, J. Dynamic System Modeling Using a Multisource Transfer Learning-Based Modular Neural Network for Industrial Application. IEEE Trans. Ind. Inform. 2024, 20, 7173–7182. [Google Scholar] [CrossRef]
  35. Wang, Z.; Yang, Y.; Wu, L.; Hong, R.; Wang, M. Making Non-overlapping Matters: An Unsupervised Alignment enhanced Cross-Domain Cold-Start Recommendation. IEEE Trans. Knowl. Data Eng. 2024, 37, 2001–2014. [Google Scholar] [CrossRef]
  36. Harper, F.M.; Konstan, J.A. The movielens datasets: History and context. Acm Trans. Interact. Intell. Syst. 2015, 5, 1–19. [Google Scholar] [CrossRef]
  37. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386. [Google Scholar] [CrossRef] [PubMed]
  38. Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI, Melbourne, Australia, 19–25 August 2017; pp. 1725–1731. [Google Scholar]
  39. Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
  40. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  41. Yan, L.; Li, W.J.; Xue, G.R.; Han, D. Coupled group lasso for web-scale ctr prediction in display advertising. In Proceedings of the International Conference on Machine Learning, Beijing, China, 22–24 June 2014; PMLR: Cambridge, MA, USA, 2014; pp. 802–810. [Google Scholar]
  42. Tang, P.; Wang, X.; Wang, Z.; Xu, Y.; Yang, X. Optimized cost per mille in feeds advertising. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, Auckland, New Zealand, 9–13 May 2020; pp. 1359–1367. [Google Scholar]
  43. Sheng, X.R.; Zhao, L.; Zhou, G.; Ding, X.; Dai, B.; Luo, Q.; Yang, S.; Lv, J.; Zhang, C.; Deng, H.; et al. One model to serve all: Star topology adaptive recommender for multi-domain ctr prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 4104–4113. [Google Scholar]
  44. Ye, Z.; Zhang, D.J.; Zhang, H.; Zhang, R.; Chen, X.; Xu, Z. Cold start to improve market thickness on online advertising platforms: Data-driven algorithms and field experiments. Manag. Sci. 2023, 69, 3838–3860. [Google Scholar] [CrossRef]
Figure 1. An example of overall framework includes two cold-start topics, each corresponding to a missing feature: UserID and ItemID, respectively. We use multi-task learning techniques to divide the model into two tasks associated with different training samples, labeled (a,b). The base model task (a) remains consistent with the original model structure. Both tasks share parameters θ in their upper dense layers, except for the Trans Blocks. The parameters ϕ of the two Trans Blocks illustrated in (c) are only updated by L a u x i l i a r y and L c o l d , which map transfer feature into the semantic space of missing feature.
Figure 1. An example of overall framework includes two cold-start topics, each corresponding to a missing feature: UserID and ItemID, respectively. We use multi-task learning techniques to divide the model into two tasks associated with different training samples, labeled (a,b). The base model task (a) remains consistent with the original model structure. Both tasks share parameters θ in their upper dense layers, except for the Trans Blocks. The parameters ϕ of the two Trans Blocks illustrated in (c) are only updated by L a u x i l i a r y and L c o l d , which map transfer feature into the semantic space of missing feature.
Information 16 01105 g001
Figure 2. Illustration of the four scenarios with the KuaiRec dataset. It includes one warm-up scenario and three cold-start scenarios: (1) existing items to new users (upper right), (2) new items to existing users (lower left), and (3) new items to new users (upper left). Scenarios (1) and (2) are considered few-shot, (3) represents a zero-shot scenario.
Figure 2. Illustration of the four scenarios with the KuaiRec dataset. It includes one warm-up scenario and three cold-start scenarios: (1) existing items to new users (upper right), (2) new items to existing users (lower left), and (3) new items to new users (upper left). Scenarios (1) and (2) are considered few-shot, (3) represents a zero-shot scenario.
Information 16 01105 g002
Figure 3. Distribution of average click number and CTR across frequency percentiles on the KuaiRec (a) and MovieLens-1M (b) datasets. The bars represent the average click count per item on a logarithmic scale, and the line indicates the CTR for each percentile.
Figure 3. Distribution of average click number and CTR across frequency percentiles on the KuaiRec (a) and MovieLens-1M (b) datasets. The bars represent the average click count per item on a logarithmic scale, and the line indicates the CTR for each percentile.
Information 16 01105 g003
Figure 4. Performance of MAIF (Wide&Deep) under varying Trans Block activation thresholds τ (percentiles of item click counts) on KuaiRec (a) and MovieLens-1M (b) datasets with ±95% confidence intervals over 5 independent runs.
Figure 4. Performance of MAIF (Wide&Deep) under varying Trans Block activation thresholds τ (percentiles of item click counts) on KuaiRec (a) and MovieLens-1M (b) datasets with ±95% confidence intervals over 5 independent runs.
Information 16 01105 g004
Figure 5. The AUC of the advertising item lifecycle over one month in both the control and experimental groups.
Figure 5. The AUC of the advertising item lifecycle over one month in both the control and experimental groups.
Information 16 01105 g005
Table 1. A description of the mapping relationship about the cold-start topic of the KuaiRec and MovieLens-1M datasets.
Table 1. A description of the mapping relationship about the cold-start topic of the KuaiRec and MovieLens-1M datasets.
DatasetCold-Start TopicMissing FeatureTransfer Feature
KuaiRecuser topicuser_iduser_active_degree, is_live_streamer, is_video_author, register_days, onehot_feat0-17
item topicvideo_idvideo_type, date, upload_type, video_width, video_height, video_tag_id, video_tag_name, show_cnt, play_cnt, valid_play_cnt, like_cnt, comment_cnt, follow_cnt, share_cnt, collect_cnt
MovieLens-1Muser topicuseridgender, age, occupation, zip-code
item topicmovieidtitle, year of release, genres
Any fields not listed are considered common features. In KuaiRec and MovieLens-1M, when watch _ ratio 2 and ratings 4 , y = 1 , the rest y = 0 .
Table 2. The offline experimental results on the KuaiRec dataset in four scenarios.
Table 2. The offline experimental results on the KuaiRec dataset in four scenarios.
ScenarioModelAUCRelaImpPCOC
Warm-Up
fully observed
MLP0.9334 ± 0.00170.0%1.0312 ± 0.010
DeepFM0.9436 ± 0.00162.35%1.0539 ± 0.015
Wide & Deep0.9478 ± 0.00133.32%1.0220 ± 0.018
Dropout-Net0.9134 ± 0.0019−4.61%1.0456 ± 0.016
MICE (MLP)0.9333 ± 0.0017−0.023%1.0333 ± 0.011
MeLU (MLP)0.9303 ± 0.0016−0.72%1.0512 ± 0.016
MEG (MLP)0.9310 ± 0.0016−0.55%1.0669 ± 0.014
MAIF (MLP)0.9331 ± 0.0017−0.07%1.0307 ± 0.012
MAIF (DeepFM)0.9436 ± 0.00162.35%1.0824 ± 0.019
MAIF (Wide & Deep)0.9477 ± 0.00153.30%1.0422 ± 0.013
User Cold-Start
few-shot
MLP0.8615 ± 0.00300.0%1.2104 ± 0.054
DeepFM0.8767 ± 0.00294.20%1.2380 ± 0.059
Wide&Deep0.8988 ± 0.003110.32%1.2616 ± 0.056
Dropout-Net0.8702 ± 0.00332.41%1.1113 ± 0.020
MICE (MLP)0.8723 ± 0.00352.98%1.1321 ± 0.021
MeLU (MLP)0.8815 ± 0.00295.53%1.0913 ± 0.024
MEG (MLP)0.8988 ± 0.002610.32%1.0867 ± 0.021
MAIF (MLP)0.9002 ± 0.002810.71%* 1.0689 ± 0.014
MAIF (DeepFM)* 0.9015 ± 0.002911.07%* 1.0754 ± 0.017
MAIF (Wide & Deep)* 0.9106 ± 0.002813.58%* 1.0898 ± 0.018
MAIF (Wide & Deep) w/o reused* 0.8447 ± 0.0025−4.65%* 1.0873 ± 0.017
MAIF (Wide & Deep) w/o L a u x i l i a r y * 0.8965 ± 0.00269.68%* 1.1004 ± 0.019
Item Cold-Start
few-shot
MLP0.7040 ± 0.00300.0%1.7020 ± 0.18
DeepFM0.7195 ± 0.00317.6%1.8175 ± 0.20
Wide & Deep0.7265 ± 0.002811.03%1.6987 ± 0.15
Dropout-Net0.7123 ± 0.00324.07%1.2031 ± 0.020
MICE (MLP)0.7198 ± 0.00357.74%1.3321 ± 0.051
MeLU (MLP)0.7388 ± 0.003117.06%1.2003 ± 0.018
MEG (MLP)0.7425 ± 0.003018.87%1.1775 ± 0.022
MAIF (MLP)* 0.7490 ± 0.002922.06%* 1.1249 ± 0.015
MAIF (DeepFM)* 0.7528 ± 0.002523.92%* 1.1182 ± 0.016
MAIF (Wide & Deep)* 0.7592 ± 0.002727.06%* 1.1307 ± 0.015
MAIF (Wide & Deep) w/o reused* 0.6783 ± 0.0024−12.59%* 1.1024 ± 0.014
MAIF (Wide & Deep) w/o L a u x i l i a r y * 0.7394 ± 0.002517.35%* 1.1507 ± 0.017
User–Item Cold-Start
zero-shot
MLP0.8244 ± 0.00270.0%1.2800 ± 0.064
DeepFM0.8204 ± 0.0022−1.23%1.2473 ± 0.058
Wide&Deep0.8268 ± 0.00320.74%1.2437 ± 0.060
Dropout-Net0.8275 ± 0.00330.96%1.1556 ± 0.023
MICE (MLP)0.8366 ± 0.00353.76%1.1844 ± 0.031
MeLU (MLP)0.8444 ± 0.00376.17%1.1881 ± 0.025
MEG (MLP)0.8504 ± 0.00398.01%1.1773 ± 0.026
MAIF (MLP)* 0.8655 ± 0.003712.67%* 1.0812 ± 0.023
MAIF (DeepFM)* 0.8681 ± 0.003713.47%* 1.0992 ± 0.021
MAIF (Wide & Deep)* 0.8683 ± 0.003213.53%* 1.0923 ± 0.020
MAIF (Wide & Deep) w/o reused* 0.8064 ± 0.0025−5.55%* 1.0996 ± 0.018
MAIF (Wide & Deep) w/o L a u x i l i a r y * 0.8534 ± 0.00338.94%* 1.1047 ± 0.023
* Results are reported as mean ± 95% confidence intervals over 5 independent runs. The symbol * indicates statistical significance ( p < 0.05 ) compared to the best baseline based on a two-sample t-test, the best results shown in bold.
Table 3. The offline experimental results on the MovieLens-1M dataset in four scenarios.
Table 3. The offline experimental results on the MovieLens-1M dataset in four scenarios.
ScenarioModelAUCRelaImpPCOC
Warm-Up
fully observed
MLP0.7216 ± 0.00120.0%1.0340 ± 0.007
DeepFM0.7268 ± 0.00132.34%1.0371 ± 0.009
Wide&Deep0.7279 ± 0.00152.84%1.0386 ± 0.009
Dropout-Net0.7151 ± 0.0017−2.93%1.0301 ± 0.007
MICE (MLP)0.7215 ± 0.0018−0.04%1.0548 ± 0.011
MeLU (MLP)0.7151 ±0.00122.93%1.0695 ± 0.007
MEG (MLP)0.7166 ± 0.00132.25%1.0719 ± 0.008
MAIF (MLP)0.7215 ±0.0015−0.04%1.0320 ± 0.008
MAIF (DeepFM)0.7269 ± 0.00152.39%1.0384 ± 0.007
MAIF (Wide&Deep)0.7279 ± 0.00132.84%1.0382 ± 0.009
User Cold-Start
few-shot
MLP0.6602 ± 0.00220.0%1.2348 ± 0.060
DeepFM0.6629 ± 0.00211.68%1.2595 ± 0.055
Wide&Deep0.6681 ± 0.00214.93%1.2712 ± 0.065
Dropout-Net0.6544 ± 0.0023−3.62%1.1522 ± 0.032
MICE (MLP)0.6741 ± 0.00258.67%1.1773 ± 0.036
MeLU (MLP)0.6710 ± 0.00206.74%1.0765 ± 0.013
MEG (MLP)0.6747 ± 0.00209.05%1.0868 ±0.015
MAIF (MLP)* 0.6956 ± 0.001922.09%* 1.0524 ±0.010
MAIF (DeepFM)* 0.6994 ± 0.001824.46%* 1.0673 ± 0.013
MAIF (Wide&Deep)* 0.7013 ± 0.001925.65%* 1.0529 ± 0.011
MAIF (Wide&Deep) w/o reused* 0.6412 ± 0.0018−11.86%* 1.0531 ± 0.011
MAIF (Wide&Deep) w/o L a u x i l i a r y * 0.6887 ± 0.001917.79%* 1.0732 ± 0.013
Item Cold-Start
few-shot
MLP0.6337 ± 0.00240.0%1.3974 ± 0.092
DeepFM0.6470 ± 0.00249.94%1.3647 ± 0.085
Wide&Deep0.6482 ± 0.002210.84%1.4017 ± 0.081
Dropout-Net0.6198 ± 0.0023-10.39%1.2019 ± 0.031
MICE (MLP)0.6540 ± 0.002215.18%1.2863 ± 0.037
MeLU (MLP)0.6627 ± 0.002021.69%1.1872 ± 0.028
MEG (MLP)0.6672 ± 0.002025.05%1.1510 ± 0.025
MAIF (MLP)* 0.6774 ± 0.002132.65%* 1.0770 ± 0.018
MAIF (DeepFM)* 0.6856 ± 0.002038.81%* 1.0695 ± 0.015
MAIF (Wide&Deep)* 0.6885 ± 0.002140.98%* 1.0691 ± 0.015
MAIF (Wide&Deep) w/o reused* 0.6178 ± 0.0019−11.89%* 1.0744 ± 0.014
MAIF (Wide&Deep) w/o L a u x i l i a r y * 0.6782 ± 0.002233.28%* 1.0952 ± 0.018
User–Item Cold-Start
zero-shot
MLP0.6444 ± 0.00190.0%1.317 ± 0.072
DeepFM0.6481 ± 0.00202.56%1.3228 ± 0.078
Wide&Deep0.6515 ± 0.00204.91%1.3614 ± 0.069
Dropout-Net0.6313 ± 0.0022−9.07%1.1407 ± 0.025
MICE (MLP)0.6616 ± 0.002111.91%1.2106 ± 0.035
MeLU (MLP)0.6583 ±0.00229.62%1.1196 ± 0.015
MEG (MLP)0.6682 ± 0.002116.48%1.0964 ± 0.012
MAIF (MLP)* 0.6749 ± 0.002021.12%* 1.0624 ± 0.016
MAIF (DeepFM)* 0.6773 ± 0.001922.78%* 1.0632 ± 0.015
MAIF (Wide&Deep)* 0.6802 ± 0.002124.79%* 1.0585 ± 0.017
MAIF (Wide&Deep) w/o reused* 0.6272 ± 0.0018−11.91%* 1.0663 ± 0.012
MAIF (Wide&Deep) w/o L a u x i l i a r y * 0.6661 ± 0.002115.03%* 1.0789 ± 0.019
* Results are reported as mean ± 95% confidence intervals over 5 independent runs. The symbol * indicates statistical significance ( p < 0.05 ) compared to the best baseline based on a two-sample t-test, the best results shown in bold.
Table 4. Online A/B test result for three cold-start scenarios.
Table 4. Online A/B test result for three cold-start scenarios.
ModelAUCRelaImpPCOCImpressionClickCTR
Control0.79280.0%0.92291.01.00.01739
Experimental* 0.8157* 7.8%* 1.0415* 1.203* 1.2205* 0.01763
* The online A/B test result for three cold-start scenarios in the advertising system of DZDP. Best results are shown in bold, with symbol * indicating statistical significance ( p < 0.05 ) based on a two-sample t-test. Note that the impression and click metrics have been data anonymized.
Table 5. Online A/B test for prediction latency and failure rate.
Table 5. Online A/B test for prediction latency and failure rate.
ModelTP50 (ms)TP99 (ms)TP999 (ms)Failure (%)
Control17.023.338.50.121
Experimental19.224.338.50.134
Values represent average monitoring statistics over the online A/B testing period.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, X.; Zhang, W.; Jiang, F.; Zhang, X. An Industrial Framework for Cold-Start Recommendation in Few-Shot and Zero-Shot Scenarios. Information 2025, 16, 1105. https://doi.org/10.3390/info16121105

AMA Style

Cao X, Zhang W, Jiang F, Zhang X. An Industrial Framework for Cold-Start Recommendation in Few-Shot and Zero-Shot Scenarios. Information. 2025; 16(12):1105. https://doi.org/10.3390/info16121105

Chicago/Turabian Style

Cao, Xulei, Wenyu Zhang, Feiyang Jiang, and Xinming Zhang. 2025. "An Industrial Framework for Cold-Start Recommendation in Few-Shot and Zero-Shot Scenarios" Information 16, no. 12: 1105. https://doi.org/10.3390/info16121105

APA Style

Cao, X., Zhang, W., Jiang, F., & Zhang, X. (2025). An Industrial Framework for Cold-Start Recommendation in Few-Shot and Zero-Shot Scenarios. Information, 16(12), 1105. https://doi.org/10.3390/info16121105

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop