Diffusion Model as a Base for Cold Item Recommendation

Han, Jungkyu; Chun, Sejin

doi:10.3390/app15094784

Open AccessArticle

Diffusion Model as a Base for Cold Item Recommendation

by

Jungkyu Han

and

Sejin Chun

^*

Department of Computer Engineering, Dong-A University, Busan 49315, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4784; https://doi.org/10.3390/app15094784

Submission received: 31 March 2025 / Revised: 23 April 2025 / Accepted: 24 April 2025 / Published: 25 April 2025

(This article belongs to the Special Issue AI-Supported Decision Making and Recommender Systems)

Download

Browse Figures

Versions Notes

Abstract

Cold items are a critical problem in the recommendation domain because newly introduced items lack user–item interactions to train accurate collaborative filters (CFs). Recent studies have adopted neural networks such as MLPs and autoencoders to predict collaborative embeddings learned by CFs, using items’ side information available at the time of registration. As a generative model, diffusion models have achieved success in various fields, such as image generation and natural language processing, through their superior generative capability. This paper proposes a diffusion model as a base for cold item recommendation by generating collaborative embeddings for cold items. First, using a diffusion model with our customized predictor, we directly generate items’ collaborative embeddings conditioned on their side information. Then, a second-stage refiner, which adopts simple MLPs and dropout, is trained to calculate the final recommendation scores. The proposed method requires only the user embeddings from the existing model and the side information of cold items, making it easy to integrate into existing recommender systems. Extensive experiments on real-world datasets show that the proposed method outperforms state-of-the-art baseline methods and indicates the potential of diffusion models in cold item recommendation.

Keywords:

cold item recommendation; diffusion model; recommender system

1. Introduction

Collaborative filtering (CF) [1], a widely used approach for personalized recommendation, suffers from the cold item problem when newly introduced items are added to recommender systems [2,3,4]. “Cold items” refer to items that lack sufficient user interactions. Because CF methods rely on user–item interaction data to make accurate recommendations, their performance significantly degrades when recommending cold items. Extensive studies have been conducted on the cold item problem, as it is essential for ensuring users can discover relevant and diverse content in a timely manner, while also helping item providers gain early exposure for newly introduced items. One group of studies [5,6,7,8,9,10,11,12] studied the methods for fast warm-up, which achieves accurate recommendations using as little interaction data as possible. Another group [13,14,15,16,17,18,19,20,21,22] is interested in improving start-time recommendation accuracy when no interaction data are available.

To improve the start-time accuracy of cold item recommendation, content-based filters [23] that exploit the similarity of side information between candidate items and the items interacted with by a target user were studied in the early era of personalized recommendation. Side information refers to metadata that describe items—such as tags, texts, or images—typically provided by item creators or sellers. It offers auxiliary signals that are strongly correlated with the latent properties of the items. Therefore, it plays a pivotal role in user interactions because it is often the first and only information that users encounter before deciding interactions with items. As a result, items with similar side information tend to elicit similar interaction patterns from users. Furthermore, side information is readily available at the time of item registration, as item providers are generally motivated to include detailed descriptions to attract user attention. Therefore, in the absence of historical interaction data, side information becomes the most critical factor for inferring the interaction patterns of new items. Although the traditional content-based filters mitigated the problem to some extent, there is still room for improvement.

In the last decade, several theoretical foundations and methods that can be applicable to cold item recommendations, such as compressed sensing [24] and pairwise comparisons [25], have been proposed. Among them, methods using neural networks have been studied the most. The studies in [13,14,15,16,17,18,19,20,21,22] tried to “generate” items’ collaborative embeddings, which are originally learned by CFs from items’ side information. These works explored various effective network architectures to achieve their goal. Specifically, these include convolutional neural networks [17] and autoencoders [13,14,15,16], and parameter sharing between user and item predictors [18], techniques to prevent overfitting [19,20], and the design of effective objective functions [20,21,22] are also investigated.

The diffusion model (DM) [26,27] is a powerful generative model which has shown prominent results in various domains [28,29,30,31,32]. Unlike traditional generative models like GANs and VAEs, which face stability and representation limitations, diffusion models employ a forward-noising and reverse-denoising process, aligning well with the nature of recommendation tasks where user interactions often contain noise and require reconstruction. Because of their superior generative power and denoising performance, researchers have started investigating effective ways to apply DMs in recommender systems. For instance, DMs were adopted to recommend items from the user’s interaction vector [33,34,35,36,37], denoising the collaborative embeddings learned by base recommender models [38,39], and incorporating multi-modal knowledge [40] to improve general recommendation performance. DMs are also applied to augment data for the next-item [41,42,43] and cross-domain recommendation settings [44].

DMs are inherently well suited for improving the start-time accuracy of cold items, as they control the generation process based on given conditions. By conditioning on side information, DMs can generate more “natural” collaborative embeddings for cold items. However, only a few studies [44,45] have applied DMs to cold item recommendation, and these are not directly focused on improving start-time recommendation accuracy. Further investigation is needed to explore how DMs can be effectively applied in this domain. To this end, we propose a method that adopts a DM as the foundation for cold item recommendation, leveraging its generative capability without modifying existing CF-based recommender systems.

Our proposed method consists of two modules. In the first module, we use a DM [46] to directly generate items’ collaborative embeddings, conditioned on their side information. To more accurately generate collaborative embeddings of cold items, we adopt guidance information calculated from a mixture-of-expert architecture in the predictor of the DM. Then, the generated item collaborative embeddings, along with the user embeddings calculated by the base recommender, are passed to the second module, which refines the embeddings to enable more accurate recommendations. We train the second module using the same interaction data from the training set that was used to train the base recommender.

Our contributions are three-fold:

To the best of our knowledge, our work is the first to evaluate the potential of DMs in improving the start-time recommendation accuracy of cold item recommendations.
We propose a simple add-on method that is easy to implement compared to more complex models, and easily applicable to existing base recommender systems without modifications.
Through extensive evaluations on real-world datasets, we demonstrated that the proposed method outperforms state-of-the-art approaches. In addition, a hyperparameter sensitivity analysis and an ablation study show how DM hyperparameters affect recommendation performance and highlight the relationship between the DM module and the refinement module. We believe these findings offer valuable insights for future research.

The rest of this paper consists of five sections. Section 2 provides a comprehensive overview of existing research relevant to this study, and contrasts the differences between prior work and the proposed method. Section 3 describes the proposed method in detail, outlining the underlying concepts and explain details for implementation. Section 4 gives the evaluation details including the dataset, evaluation metrics, and comparison methods. The evaluation and analysis of the proposed method, along with a discussion of the results, are presented in Section 5. Finally, Section 6 concludes the paper by summarizing the key findings and suggesting directions for future work.

2. Related Work

This section reviews the studies related to the proposed method. Since the proposed method adopts a DM for cold item recommendation, we first review works adopting general neural network architectures for cold item recommendation. We then discuss works adopting DMs to recommender systems.

2.1. Cold Item Recommendation

Studies on cold-start recommendation (for both users and items) can be broadly classified into two categories. The first group of studies [5,6,7,8,9,10,11,12] focuses on fast warm-up, which aims to achieve accurate recommendations using minimal interaction data. To achieve their goal, they adopted meta-learning [5,6]. Some studied the fast warm-up of cold users by training per-user recommenders [7], adopting memory structures [8], or using task-adaptive mechanisms [9] to effectively re-use information for fast adaptation. Other studies [10,11,12,47] studied cold item warm-up. Pan et al. [10] learned item embeddings that encode the interaction characteristics of items by meta-learning. Lu et al. [11] adopted a Heterogeneous Information Network (HIN) [48,49] to improve the performance of meta-learning. Zhu et al. [12] updated collaborative embeddings of cold items using meta-scaling and meta-shift. Pan et al. [47] trained multiple meta-learners to exploit different types of item contents, such as review text and images, to improve the next-item recommendation performance.

Another group of studies tried to increase the start-time recommendation accuracy of cold items by using side information. Since the cold item problem arises from a lack of interaction data, knowledge propagation on graphs can be one of the solutions to augment deficient user–item interaction data. Wang et al. [50] improved cold item recommendation performance in cross-domain settings by transferring knowledge from a source domain graph with sufficient interactions and side information to a target domain graph that contains only side information. Togashi et al. [51] generated pseudo-user–item interactions based on the strength of semantic relations calculated with a knowledge graph.

As another approach, methods that tried to predict collaborative embeddings of cold items based on their side information were studied. Various neural network architectures suitable for collaborative embedding predictors were investigated to achieve accurate prediction. For music recommendation, Galvan et al. [13] integrated the gravity model into an attribute graph connecting similar artists and side musical information for graph autoencoder to suggest cold artists to users, and Oramas et al. [17] evaluated a CNN architecture for music embedding generation from side information such as track and artist information. Zhao et al. [15] trained two variational autoencoders [52], the first encoder to encode the item collaborative embeddings and the second encoder to generate the collaborative embeddings encoded by the first encoder based on the side information of cold items. Hansen et al. [16] used an autoencoder to predict collaborative embeddings, represented as binary hash codes, from the side information of cold items. Raziperchikolaei et al. [18] reduced model training time by sharing the parameters of the first layer between the item and user collaborative embedding predictors.

In addition to the effective network structure, other factors, such additional information and training techniques, are also investigated. Ouyang et al. [53] suggested a side information adaptation using the side information of the content neighbors of cold items. Volkovs et al. [19] proposed an effective training method for collaborative embedding predictors by adopting a dropout technique. Zhu et al. [20] showed that directly reducing the error of collaborative embedding predictor by explicitly defining the error in the objective function can improve cold item recommendation performance. Wei et al. [21] and Zhou et al. [22] adopted contrastive learning [54] techniques to train more accurate cold item recommender models.

2.2. Recommendation with Diffusion Models

Recent advancements in DMs have demonstrated significant potential in improving recommender systems by addressing issues such as data sparsity, noisy interactions, and user preference modeling.

As one of the pioneering works in this area, DMs have been effectively used to generate robust user–item interaction representations. Choi et al. [37] introduced a perturbation recovery paradigm to model user–item interactions, similar to score-based generative models but applied to collaborative filtering. Wang et al.’s work [33] extended traditional DMs to tackle high resource costs in large-scale item prediction and temporal shifts in user preferences. It introduced latent-space clustering and temporal weighting to enhance recommendation performance. Similarly, the Denoising Diffusion Recommender Model [27] applied DMs to denoise user and item collaborative embeddings, improving robustness against implicit feedback noise.

Graph-based approaches leveraged DMs to enhance collaborative filtering. Zhu et al.’s work [34] applied heat equation-based smoothing on an item–item similarity graph to refine user preferences. Hou et al. [35] integrated multi-hop cross-attention autoencoders to capture high-order collaborative signals, significantly improving user–item interaction modeling.

DMs were also applied to the sequential recommendation. Niu et al. [43] combined multi-scale CNNs and residual LSTMs to extract local and global user behavior patterns. They effectively improved short-term and long-term sequence modeling by conditioning the diffusion process on user interaction sequences, significantly enhancing sequential recommendation accuracy. Ma et al. [41] employed a time-interval DM to infer users’ preferences across all items, addressing the data sparsity issue. Similarly, Yang et al. [42] reshaped sequential recommendation as a learning-to-generate problem using guided diffusion, removing the reliance on negative sampling.

In multi-modal recommendation, Ma et al. [39] and Jiang et al. [40] integrated diffusion models to align multi-modal content features (e.g., text, image, and audio) with user–item interactions, leading to improved recommendation accuracy.

For the cold start problem, Wang et al. [44] integrated a variance-scheduled diffusion process with a classifier to enhance cold start multi-scenario recommendations. To improve recommendation accuracy, they effectively learned domain-specific and shared representations to balance knowledge transfer between cold start and rich data domains. However, no notable works have been published in the cold item recommendation domain that seek to improve start-time cold item recommendations.

3. Cold Item Recommender with a Diffusion Model Base

In this section, we first define the problem to be solved, then introduce the diffusion model as a preliminary model, and finally present our proposed method. Table 1 represents the notations frequently used throughout this paper.

3.1. Cold Item Recommendation

The cold item recommendation problem that we are trying to solve is defined as follows. Users in a user set U and items in an item set I make interactions such as clicks and ratings. Items in I are classified into one of the two mutually non-intersected subsets: warm item set

I_{w}

and cold item set

I_{c}

. Regardless of their subset, each item i has its side information embedding

c_{i}

. The interactions are represented as two rating matrices,

R^{t r} \in R^{| U | \times | I_{w} |}

and

R^{t e} \in R^{| U | \times | I_{c} |}

, whose elements indicate interactions between users and warm items and between users and cold items, respectively. Each user

u \in U

and warm item

i \in I_{w}

has its own collaborative embedding

e_{u}

and

e_{i}

, which is calculated by a base CF method by using

R^{t r}

as the training data.

Our goal is, for a given

u \in U

and cold item set

I_{c}

, to calculate an ordered list of k cold items

L_{u}^{k}

that maximizes the prediction accuracy about u’s interaction in

R^{t e}

. To calculate

L_{u}^{k}

, we can only use

e_{u}

, and

c_{i}

of cold items as input. In the training, we exploit the information from

e_{u}

and

e_{i}

, collaborative embeddings of users and warm items,

R^{t r}

, and

c_{i}

of all items.

3.2. Preliminary: Diffusion Model

DM generates realistic data that satisfy a given condition (requirement, e.g., generate an image from its description represented by text embedding) from pure Gaussian noise. DM consists of a forward process that adds noise to the original data and a backward process that generates the original data from noise.

Forward Process: In the forward process, Gaussian noises are added T times to given original data

x_{0}

to obtain a noised data sequence

x_{1}

,

x_{2}

, …,

x_{T}

. Noised data

x_{t}

at timestep t are sampled from

x_{t - 1}

using Equation (1).

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(1)

where

N

, I, and

β_{t}

indicate Gaussian distribution, the identity matrix, and noise scale at timestep t, respectively.

β_{t}

increases as timestep t increases. It can be learned from the data, but generally a predefined fixed scheduler is used such as the linear scheduler [27], or cosine scheduler [55], or sigmoid scheduler [56]. We adopted the linear scheduler represented in Equation (2), due to its empirical stability [33,38] and implementation simplicity.

β_{t} = β_{s} \cdot (1 - \frac{t - 1}{T - 1}) + β_{e} \cdot \frac{t - 1}{T - 1}

(2)

where

β_{s}

,

β_{e}

, and the maximum timestep T are hyperparameters. Following the proposal by Ho et al. [27], we set

β_{s} = 0.0001

. The optimal values for

β_{e}

and T will be evaluated in Section 5.

As shown in Equation (1), sampling

x_{t}

depends on its previous

x_{t - 1}

only, and when Gaussian noises are applied, we can assume

x_{t}

is sampled from

x_{0}

by the single noise addition represented in Equations (3) and (4) [46].

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, 1 - {\bar{α}}_{t} I)

(3)

{\bar{α}}_{t} = Π_{i = 1}^{t} α_{i}, α_{t} = 1 - β_{t}

(4)

Since

β_{t}

increases as t increases,

α_{t}

and

{\bar{α}}_{t}

decrease. As a result, the information in

x_{0}

decreases, but the noise increases as t increases. At the last time step T,

x_{T}

converges to pure Gaussian noise.

Backward Process: The backward process of DM is a denoising process. From

x_{T}

, the fully noised data at the final timestep, the noise is removed step by step to recover

x_{0}

. The ground-truth denoising step, which is derived by using Bayes’ rule, is represented in Equations (5)–(7).

q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; μ_{q} (x_{t}, x_{0}, t), σ_{q}^{2} (t) I)

(5)

μ_{q} (x_{t}, x_{0}, t) = \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1}) x_{t} + \sqrt{{\bar{α}}_{t - 1}} (1 - α_{t}) x_{0}}{1 - {\bar{α}}_{t}}

(6)

σ_{q}^{2} (t) = \frac{(1 - {\bar{α}}_{t - 1}) (1 - α_{t})}{(1 - {\bar{α}}_{t})}

(7)

Unfortunately, we do not know the original data

x_{0}

at the inference time. Therefore, instead of using

x_{0}

, we train and adopt a

x_{0}

-predictor which predicts

x_{0}

from

x_{t}

and t by using neural networks. In addition, because we want to control DM’s denoising (generation) direction, the predictor also takes the generation condition c as an input. As a result,

x_{0}

in Equation (6) is replaced by

x_{0}

-predictor

{\hat{x}}_{θ} (x_{t}, t, c)

and generates a new equation, Equation (8). The final denoising sample follows Equation (9).

μ_{θ} (x_{t}, t, c) = \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1}) x_{t} + \sqrt{{\bar{α}}_{t - 1}} (1 - α_{t}) {\hat{x}}_{θ} (x_{t}, t, c)}{1 - {\bar{α}}_{t}}

(8)

p_{θ} (x_{t - 1} | x_{t}, c) = N (x_{t - 1}; μ_{θ} (x_{t}, t, c), σ_{q}^{2} (t) I)

(9)

Training: DM training is equivalent to training

{\hat{x}}_{θ} (x_{t}, t, c)

. Because the goal of

{\hat{x}}_{θ} (x_{t}, t, c)

is to predict

x_{0}

, the objective function is designed to minimize the prediction error. We follow Ho et al.’s proposal [27], Equation (10), which not only reduces computational cost but also assigns greater importance to predictions made from noisier inputs, i.e., the more challenging cases.

L_{d} = E_{t, x_{0}, c} ‖ x_{0} - {\hat{x}}_{θ} (x_{t}, t, c) ‖

(10)

where

‖ \cdot ‖

represents the vector norm.

Algorithm 1 shows the DM training details. For a given data point

x_{0}

and its condition c, an arbitrary timestep t is uniformly sampled between 1 and T, and the gradient of the objective function is calculated for the predictor training.

Algorithm 1 DM Train

Input: Dataset X with assigned conditions to its elements, maximum timestep T
Output: Trained model parameter

θ

1:: Initialize $θ$ with random values
2:: while $θ$ is not converged do
3:: Randomly sample a batch $X^{'} \subset X$
4:: for all $(x_{0}, c) \in X^{'}$ do
5:: Sample $t \sim U (1, T)$
6:: Sample $x_{t} \sim q (x_{t} | x_{0})$ (Equation (3))
7:: Calculate loss $L_{d}$ using Equation (10)
8:: Update $θ$ using gradient descent with $\nabla_{θ} L_{d}$
9:: end for
10:: end while

Inference: Algorithm 2 shows DM’s inference procedure details. DM takes a given pure Gaussian noise and condition pair and performs denoising for each timestep to generate the data satisfying the condition. The reparameterization trick [57] is adopted for sampling.

Algorithm 2 DM Inference

Input: Condition c, maximum timestep T

1:: $x_{T} \sim N (0, I)$
2:: for t←T to 1 do
3:: $ϵ \sim N (0, I)$
4:: if t == 1 then
5:: $ϵ = 0$
6:: end if
7:: $x_{t - 1} = μ_{θ} (x_{t}, t, c) + σ_{q} (t) ϵ$ ( $μ_{θ} (\cdot)$ : Equation (8), $σ_{q} (t)$ : The square root of Equation (7))
8:: end for
9:: return $x_{0}$

3.2.1. Overview of the Proposed Method

We first describe the overview of the proposed method and then explain the details in the following subsections. Figure 1 shows the two-phase recommendation of the proposed method. In the first phase, the DM-based generator generates collaborative embeddings of cold items using their side information as input. Then, the generated collaborative embeddings of cold items and the collaborative embeddings of users calculated by the base recommender are fed into the refiner to calculate more accurate recommendations for each user. The generated embeddings contain some errors that originated from information deficiency. For instance, coarse-grained side information of the cold items is given as the generation condition, and valuable user-side information, the ground-truth collaborative embeddings of users, is not considered. We employed the refiner to correct the error in the generated collaborative embeddings by the DM-based generator.

We train the DM-based generator first and then train the refiner using the trained generator. Although we adopted the refiner, it is worth noting that we can directly use the collaborative embeddings of cold items generated by the DM-based generator to calculate recommendation scores just by applying dot products with user collaborative embeddings.

3.2.2. DM-Based Generator

The DM-based generator is a DM that generates collaborative embeddings

e_{i}

of cold items from their side information

c_{i}

as its generation condition. Therefore, in the training, we assigned

e_{i}

and

c_{i}

to the original data

x_{0}

and c in Algorithm 1, respectively.

Network architecture: For the

x_{0}

-predictor

{\hat{x}}_{θ} (x_{t}, t, c_{i})

in Equation (8), instead of simple MLPs used in previous works [33,38], we added a guideline predictor consisting of a Mixture-of-Experts network (MoE) [58] to handle complex collaborative embedding distributions more effectively and adapt to diverse patterns with limited information on items’ side information. As shown in Figure 2, side information

c_{i}

is given to the MoE to generate the first prediction of collaborative embedding

{\hat{x}}_{0}^{(1)}

; then, with the inputs of DM,

{\hat{x}}_{0}^{(1)}

is fed into the MLP as additional information to predict

x_{0}

, the collaborative embedding of cold item

e_{i}

.

The MoE network consists of the gate network represented in Equation (11) and multiple experts

e \in E

that have the same network structure described in Equation (12), but different parameters are learned from each other.

g = s i g m o i d (W_{g} \times c_{i} + b_{g})

(11)

where

g \in R^{| E |}

indicates the gate vector whose elements represent the weights assigned to individual experts’ output.

W_{g} \in R^{| E | \times d_{c o n d}}

,

b_{g} \in R^{| E |}

, and

s i g m o i d

indicate the gate network’s weight matrix, bias, and the sigmoid activation function, respectively.

{\hat{x}}_{0, e} = W_{e} \times c_{i} + b_{e}

(12)

where

W_{e} \in R^{d_{c f} \times d_{c o n d}}

and

b_{e} \in R^{d_{c f}}

represent the weight matrix and bias of individual expert e. We chose a single-layer linear transformation without an activation function as the expert network because our pre-experiments showed that the configuration mentioned above shows better performance than more complicated non-linear networks, which is also supported by other works [20,59].

The first prediction

{\hat{x}}_{0}^{(1)}

for

x_{0}

is calculated by the weighted sum of the experts’ outputs, as represented in Equation (13).

{\hat{x}}_{0}^{(1)} = t a n h (Σ_{e = 1}^{| E |} g_{i} \cdot {\hat{x}}_{0, e})

(13)

where

g_{i}

and

t a n h

represent the value of the i-th element of g and the hyperbolic tangent activation function, respectively.

With

{\hat{x}}_{0}^{(1)}

,

x_{t}

,

c_{i}

, and the time step embedding [60]

t e_{t} \in R^{d_{t i m e}}

, which is related to time step t, are concatenated and passed to the three-layer MLP to calculate the final prediction

{\hat{x}}_{0}^{(2)}

for

x_{0}

. The network structure of the MLP is represented in Equations (14)–(16).

m l p^{(1)} = s w i s h (W_{m l p}^{(1)} \times (x_{t} ‖ {\hat{x}}_{0}^{(1)} ‖ c_{i} ‖ t e_{t}) + b_{m l p}^{(1)})

(14)

where

W_{m l p}^{(1)} \in R^{d_{c f} \times (2 \times d_{c f} + d_{c o n d} + d_{t i m e})}

and

b_{m l p}^{(1)} \in R^{d_{c f}}

represent the weight matrix and bias of the first layer of the MLP.

s w i s h

indicates the swish activation function [61].

m l p^{(2)} = s w i s h (W_{m l p}^{(2)} \times m l p^{(1)} + b_{m l p}^{(2)})

(15)

{\hat{x}}_{0}^{(2)} = W_{m l p}^{(3)} \times m l p^{(2)} + b_{m l p}^{(3)}

(16)

where

W_{m l p}^{(l)} \in R^{d_{c f} \times d_{c f}}

and

b_{m l p}^{(l)} \in R^{d_{c f}}

represent the weight matrix and bias of the l-th layer of the MLP. The only difference in the network structure between the second and third layers is the existence of the activation function. We attached a linear transformation layer as the third layer because the value of collaborative embeddings does not indicate the probability, and we do not have concrete prior knowledge about its range.

Training: The quality of

{\hat{x}}_{0}^{(1)}

is important because the MLP that calculates the final prediction exploits the prediction of the MoE as its prediction guide. To improve the prediction accuracy of

{\hat{x}}_{0}^{(1)}

, we adopted the objective function in Equation (17) instead of Equation (10) in Algorithm 1.

L_{d} = E_{t, x_{0}, c} (‖ x_{0} - {\hat{x}}_{0}^{(1)} ‖_{1} + ‖ x_{0} - {\hat{x}}_{0}^{(2)} ‖_{1})

(17)

where

{‖ \cdot ‖}_{1}

indicates the L1-norm. Instead of MSE, we adopted MAE because the scale of the element of the embedding vector that we want to predict is small, and MAE forces the model to reduce small prediction errors. We give the fixed weight to each error in Equation (17) to keep the model simple because our pre-experiments achieved the best performance with the configuration.

Inference: We can generate the collaborative embedding of the cold item by calculating

x_{0}

in Algorithm 2 with

c_{i}

as the input condition c.

3.3. Second-Phase Refiner

The refiner’s purpose in the second phase is to reduce the error that may be caused by the lack of user-side information or coarse side information of cold items by adjusting the ground-truth user collaborative embeddings and the generated item collaborative embeddings to predict more accurate user–item interactions. Figure 3 shows the architecture of the refiner (Refiner). The ground-truth user collaborative embeddings

e_{u}

, generated item collaborative embeddings

{\hat{e}}_{i}

, and items’ side information

c_{i}

are given. With the information provided, the Refiner calculates the recommendation score

{\hat{r}}_{u i}

which is proportional to the probability of interaction between user u and item i by calculating Equation (18) with the refined user vector

{\hat{e}}_{u}^{+}

and item vector

{\hat{e}}_{i}^{+}

. The Refiner adopts two sub-networks: one for users and one for items. The user-side sub-network that calculates

{\hat{e}}_{u}^{+}

is represented in Equation (19).

{\hat{r}}_{u i} = {\hat{e}}_{u}^{+} \cdot {\hat{e}}_{i}^{+}

(18)

{\hat{e}}_{u}^{+} = W_{U}^{(2)} \times B N (t a n h (W_{U}^{(1)} \times e_{u} + b_{U}^{(1)})) + b_{U}^{(2)}

(19)

where

W_{U}^{(1)}, W_{U}^{(2)} \in R^{d_{c f} \times d_{c f}}

, and

b_{U}^{(1)}, b_{U}^{(2)} \in R^{d_{c f}}

represent weights and biases of the first and second layer, respectively.

B N

indicates the batch normalization [62].

The item-side sub-network comprises the two layers represented in Equations (20) and (21).

h_{i} = B N (t a n h (W_{I}^{(1)} \times ({\tilde{e}}_{i} ‖ c_{i}) + b_{I}^{(1)}))

(20)

{\hat{e}}_{i}^{+} = W_{I}^{(2)} \times (h_{i} ‖ {\tilde{e}}_{i} ‖ c_{i}) + b_{I}^{(2)}

(21)

where

W_{I}^{(1)} \in R^{d_{c f} \times (d_{c f} + d_{c o n d})}

and

W_{I}^{(2)} \in R^{d_{c f} \times (2 d_{c f} + d_{c o n d})}

, and

b_{I}^{(1)}, b_{I}^{(2)} \in R^{d_{c f}}

represent weights and biases of the first and second layers, respectively. At the inference time,

{\tilde{e}}_{i}

indicates the predicted collaborative embedding of the cold item calculated by the DM-based generator. We designed each of the two layers as a collaborative embedding refiner, which refines

{\tilde{e}}_{i}

based on the side information

c_{i}

and the output of the previous layer as guidelines.

Training: To train the Refiner, we use the interaction information

R^{t r}

that is used to train the base recommender and the ground-truth item collaborative embeddings

e_{i}

calculated by the base recommender. For each positive interaction of user u whose value of

r_{u i} \in R^{t r}

is one, five negative user–item interactions whose value of

r_{u i} \in R^{t r}

is zero are sampled. The sets of positive and negative interactions are joined to create the training dataset

D^{t r}

. With

D^{t r}

, the Refiner is trained to minimize the objective function represented in Equation (22). The details of the training procedure are found in Algorithm 3.

L_{R e f i n e r} = \frac{1}{| D_{t r} |} Σ_{r_{u i} \in D_{t r}} {‖ r_{u i} - {\hat{r}}_{u i} ‖}_{2}^{2}

(22)

Algorithm 3 Refiner Train

Input: Dataset

D_{t r}

, the sets of ground-truth user, item, generated item collaborative embeddings, and items’ side information:

E_{U}

,

E_{I}

,

{\hat{E}}_{I}

, and

C_{I}

Output: Trained model parameter

θ

1:: Initialize $θ$ with random values
2:: while $θ$ is not converged do
3:: Randomly sample a batch $D_{t r}^{'} \subset D_{t r}$
4:: for all $r_{u i} \in D_{t r}^{'}$ do
5:: Extract $e_{u}$ , $e_{i}$ , ${\hat{e}}_{i}$ , and $c_{i}$ from $E_{U}$ , $E_{I}$ , ${\hat{E}}_{I}$ , and $C_{I}$ , respectively
6:: Calculate ${\hat{e}}_{u}^{+}$ and ${\hat{e}}_{i}^{+}$ by using Equations (19)–(21).
7:: Calculate ${\hat{r}}_{u i}$ by using Equation (18) with ${\hat{e}}_{u}^{+}$ and ${\hat{e}}_{i}^{+}$
8:: Calculate loss $L_{R e f i n e r}$ by using Equation (22)
9:: Update $θ$ using gradient descent with $\nabla_{θ} L_{R e f i n e r}$
10:: end for
11:: end while

When we calculate

{\hat{e}}_{i}^{+}

, following the ideas of Wang et al. [20] and Volkovs et al. [19], input

{\tilde{e}}_{i}

for the item-side sub-network represented in Equations (20) and (21) is sampled by Equation (23). By allowing the network access to the ground-truth and generated embeddings alternately for the given side information, we expect the model to more accurately guess the ground-truth embeddings based on the generated embedding under the given side information.

{\tilde{e}}_{i} = \{\begin{matrix} {\hat{e}}_{i}, & with probability 0.5 \\ e_{i}, & otherwise \end{matrix}

(23)

Inference: The Refiner calculates refined user embeddings

{\hat{e}}_{u}^{+}

from users’ ground-truth collaborative embeddings. Similarly, refined item embeddings

{\hat{e}}_{i}^{+}

are calculated with the generated collaborative embeddings and side information of cold items as input. Once calculated, we can store the two kinds of embeddings in the system. For an arbitrary user u and cold item i pair, we can calculate

{\hat{r}}_{u i}

by using Equation (18) with

{\hat{e}}_{u}^{+}

and

{\hat{e}}_{i}^{+}

. The ordered recommendation list of k-items for user u,

L_{u}^{k}

, is calculated by sorting the top k items that have the highest

{\hat{r}}_{u i}

scores in descending order.

4. Evaluation Environment

We describe the evaluation environment, which includes the datasets used, evaluation metrics, and the baseline methods compared in the evaluation.

4.1. Datasets

We evaluated the proposed method with two real-world datasets, CiteULike and MovieLens25M.

CiteULike [63] is a well-known dataset used for research in recommender systems. The data were collected from CiteULike, an online reference management and social bookmarking platform where users can save and share academic papers. The dataset includes about 0.2 million user–save–article interactions. As side information, abstracts of the articles are provided. We followed [19,20,63] to calculate the embeddings for side information. Specifically, we first selected the top 8000 high-scored words from the words that appeared in the abstract corpus by applying tf-idf and constructed an 8000-dimensional multi-hot feature vector for each paper. After that, the multi-hot vectors were deduced to 300-dimensional embedding vectors by applying SVD [64]. We randomly selected 1018 and 2378 non-interacted items as the validation and test item sets, and interactions related to the two item sets were used as validation and test sets, respectively. The remainder of the interactions were used as the training set.

MovieLens25M (ML25M) [65] is a dataset for movie recommendation research, released by the Grouplens (https://grouplens.org/ (accessed on 30 March 2025)). It contains 25 million user ratings and movie metadata and is one of the largest publicly available datasets. The dataset includes side information such as titles, genres, release years, and additional tagging information. We calculated the 400-dimensional side information embedding from the tag genome data [66] of each movie by applying VAE [52]. For train–validation–test data separation, we made five non-intersected movie groups of the same size. We selected one group and used 70% of its movies as the test movie set and the remaining 30% as the validation movie set. The last four movie groups were used as the train movie set. The interaction data associated with the three movie sets were used as the test, validation, and training sets. From all sets, ratings of less than 3.0 in 5-scale ratings were filtered out as negative ratings. We performed 5-fold cross-validation by selecting another group for the test and validation groups.

Collaborative embedding of base CF method: We used BPR [67], one of the most popular CF algorithms in academia and industry, as the base CF method to compute the collaborative embeddings. We calculated the collaborative embeddings of dimensions 200 and 128 using the training dataset for CiteULike and ML25M, respectively. Table 2 represents the statistics of the datasets.

Quality of side information embeddings: We used SVD and VAE to obtain compact embeddings for words and tag distributions, respectively. Both methods are optimized to minimize the reconstruction error of the original word or tag distribution of an item from the embedding. To investigate whether severe compression could result in information loss, we conducted experiments by varying the embedding dimensions. We observed that increasing the dimensionality beyond the values reported in the description of CiteULike and ML25M did not lead to significant improvements in reconstruction performance. This suggests that the chosen dimensions are reasonable balances between compression and information preservation.

4.2. Evaluation Metrics

We assessed the recommendation performance by employing Precision@k (P@k) [68], Recall@k (R@k) [68], and NDCG@k (N@k) [69].

P@k represents the fraction of relevant items (the items that the user has interacted with) included in the top k positions of the recommendation list. It is computed using Equation (24).

P @ k = \frac{1}{| U |} \sum_{u \in U} \frac{| L_{u}^{k} \cap I_{u}^{+} |}{| L_{u}^{k} |}

(24)

R@k measures the proportion of the interacted items within the top k recommendations relative to the total number of interacted items in the test dataset. It is determined by Equation (25).

R @ k = \frac{1}{| U |} \sum_{u \in U} \frac{| L_{u}^{k} \cap I_{u}^{+} |}{| I_{u}^{+} |}

(25)

Here,

I_{u}^{+}

refers to the set of items that user u has interacted with, while

L_{u}^{k}

denotes the set of recommended items that appeared within the top k positions of the recommendation list for u.

N@k considers the interacted items’ relevance and ranking position within the top k recommendations. Even if the number of relevant items in the top-k list remains the same, N@k assigns higher scores to items appearing in earlier ranks. It is computed using Equation (26).

N @ k = \frac{D C G @ k}{I D C G @ k}

(26)

D C G @ k = \frac{1}{| U |} \sum_{u \in U} \sum_{l = 1}^{k} \frac{1_{I_{u}^{+}} (l)}{m a x (1, l o g_{2} (l))}

(27)

Here,

1_{I_{u}^{+}} (l)

in Equation (27) represents the indicator function that identifies whether an item appeared in the l-th position in

L_{u}^{k}

is included in

I_{u}^{+}

or not. It takes a value of one if the item is included in

I_{u}^{+}

, and zero otherwise.

I D C G @ k

(Ideal

D C G @ k

) corresponds to the maximum possible

D C G @ k

where all relevant items (

i \in I_{u}^{+}

) are ranked consecutively from the top of the recommendation list.

4.3. Comparison Methods

We conducted a comparison between the proposed method and five existing related approaches.

DropoutNet [19]: This approach is designed to predict recommendation scores obtained from the base recommender model by randomly masking collaborative embeddings in its input pair’s CF embedding and side information embedding during training. By doing so, the model is trained to infer missing collaborative embeddings based solely on the item’s side information, enabling it to generalize to cold items more effectively.

Heater [20]: Heater aims to directly predict collaborative embeddings by optimizing an objective function that enforces the model to learn collaborative embeddings from cold items’ side information. The recommendation score is then derived from the predicted collaborative embeddings. To enhance prediction accuracy, Heater employs a Mixture-of-Experts predictor and CF embedding dropout strategies.

CVAR [15]: This method utilizes two encoders and a decoder within a variational autoencoder (VAE) framework [52] to transform the item’s side information into collaborative embeddings. The first encoder processes pre-trained collaborative embeddings from a base CF model while the decoder reconstructs the collaborative embeddings from the encoded representations. Meanwhile, the second encoder learns representations from the item’s side information. The objective function encourages the encoding from the second encoder to closely match that of the first encoder for the same item, ensuring that the decoder can generate valid collaborative embeddings, thus improving cold item recommendations.

CCFC [22]: CCFC leverages contrastive learning to enhance CF embedding prediction. This method enforces similar items (e.g., those with overlapping user interaction histories) to have closely aligned collaborative embeddings while ensuring that the learned embeddings remain distinct from those of unrelated items. Through this contrastive learning strategy, CCFC refines the quality of side information-driven CF embedding predictions.

GME [53]: It enhances the cold item’s side information representation by aggregating information from its neighbors with similar side information using a graph attention network. The score predictor then leverages these enriched side information representations to estimate recommendation scores effectively.

DMCIR: DMCIR (Diffusion Model-based Cold Item Recommender) indicates the DM-based generator of the proposed method to generate the collaborative embeddings from cold items’ side information. Since it generates the collaborative embeddings of the item side, it can recommend items to users by calculating dot product between the generated collaborative embeddings of candidate items and the collaborative embeddings of target users.

DMCIR+: DMCIR+ represents the results that applied the Refiner of the proposed method to the collaborative embeddings generated by DMCIR.

The methods were implemented and evaluated using TensorFlow 2.15 (https://www.tensorflow.org/ (accessed on 30 March 2025)) on a machine equipped with a Ryzen Threadripper Pro 3945WX 12-core CPU (AMD), 128 GB of RAM (Samsung, Suwon Republic of Korea), and a TITAN RTX GPU (nVidia, Santa Clara, CA, USA). The source code for the proposed method is publicly available at https://github.com/JungkyuHan/DMCIR (accessed on 30 March 2025). For the implementation of comparison methods, when the official source codes were available [15,19,20,22], we referred to them. Additionally, we adopted the hyperparameter settings that yielded the best validation performance. We evaluated the models with different initial model parameters five times and reported the average performance. DMCIR/DMCIR+ used the hyperparameters shown in Table 3.

5. Result Analysis

We conducted the experiments to address the following three research questions:

RQ1. How well does the diffusion model perform in cold item recommendations? How much can its performance be improved by incorporating the post-recommender model (the Refiner)?
RQ2. What are the optimal hyperparameter values for the DM-based generator in collaborative embedding prediction?
RQ3. To what extent does the quality of embeddings influence model performance?

5.1. Overall Performance (RQ1)

Table 4 shows the overall performance. DMCIR, the recommendation that directly used the collaborative embeddings generated by the DM-based generator, shows decent performance. Only Heater in ML25M and CCFC in CiteULike outperform or match DMCIR. The observation highlights the diffusion model’s superior capability in generating high-quality outputs. DMCIR+, recommended with the Refiner-generated collaborative embeddings from DMCIR’s results, achieved the best performance among all other comparison methods over all datasets. This observation suggests that certain collaborative information is challenging to infer solely from the items’ side information and that some can be effectively captured through a refinement process incorporating user-side collaborative embeddings.

Among the comparison methods, those that incorporate explicit collaborative embedding training guidance into their objective functions—namely Heater and CCFC—consistently outperform the others in most cases. This observation indicates that providing predictive guidance on collaborative embeddings during training effectively enhances the performance of cold item recommendation. DMCIR+ leverages the output of DMCIR as prediction guidance and achieves the best performance.

The elapsed time for cold item recommendation is shown in Table 5. It demonstrates how quickly the system can incorporate new items into the top recommendations for relevant users. “Item Gen” refers to the time required by the DM-based generator to generate item embeddings. “Item Refine” and “User Refine” indicate the time taken by the Refiner to refine item and user embeddings, respectively. As shown, generating about 17,000 item embeddings takes approximately 40 seconds, and the Refiner completes both refinement steps in less than one second. The calculation of recommendation item lists (“Rec” column) for ML25M (162,540 users and 13,816 candidate items) is also completed in under one second. Since the generation and refinement steps are performed only once per item, the “Rec” process dominates the actual recommendation cost. We argue that achieving recommendations for approximately 160,000 users in under one second is fast enough for deployment in commercial recommender systems.

5.2. DM’s Hyperparameter Effect (RQ2)

DM’s Noise Schedule: Here, we evaluate the performance of DMCIR with various values of

β_{e}

in Equation (2). Figure 4 plots the decay ratio of

x_{0}

in the forward process to timestep t (

\sqrt{\bar{α_{t}}}

in Equation (3)) for different

β_{e}

values. As the value of

β_{e}

increases, the speed of the original signal decay also increases. Table 6 plots the performance of DMCIR with

β_{e} \in {0.02, 0.03, 0.04, 0.05, 0.06}

. The other hyperparameters, the number of experts

| E |

and the maximum timestep T are fixed to 5 and 400, respectively. The average performance increases as

β_{e}

increases to 0.05. The performance starts to decrease at

β_{e} = 0.06

.

This trend indicates that the original information available in the last part of the forward process affects the reconstruction accuracy. As shown

β_{e} = 0.02

in Figure 4, when

β_{e}

is small, the signal-to-noise ratio (SNR) of the original information does not become zero at T. As Lin et al. [70] pointed out, the remaining original information in the training phase makes the denoiser (predictor) try to seek the original information at

x_{T}

, which is regarded as a pure Gaussian noise. Because we do not know collaborative embeddings of cold items, we set a pure Gaussian noise as

x_{T}

is in the inference phase. The performance drop at small

β_{e} = 0.02

is caused because the denoiser considers a noise as information from

x_{0}

. The phenomenon is mitigated as

β_{e}

increases, but when

β_{e}

is greater than 0.05, the available information of

x_{0}

in the last part of the forward process becomes too small. That also hinders the performance of the denoiser.

It is important to determine the optimal value of

β_{e}

. Based on our experimental result showing that

β_{e} \in [0.04, 0.05]

yields consistently fair performance across datasets, we suggest 0.04 to 0.05 as a reasonable choice.

Maximum Timesteps: For DM, an optimal value of the maximum timesteps T is critical. When the value of T is too small, although the training and inference costs are reduced, the noise scale applied to each timestep increases. That makes the denoising process difficult. On the other hand, when the value of T is too large, the denoising process becomes easier, resulting in increased computational costs. Figure 5 shows DMCIR’s performance with the CiteULike ((a)) and ML25M ((b)) datasets to various

T \in {100, 200, 300, 400, 500, 600}

. We fixed the other parameters as

β_{e} = 0.05

and

| E | = 5

.

Generally, the recommendation performance increases as the value of T increases and reaches its peak at 300 and 400. After that, the performance starts to decline. This observation is caused by the extremely small noise size of the single timestep. The extremely small size of the noise is also difficult to predict accurately. Considering the performance and cost trade-off, the value of T as 300 or 400 is a good choice.

The Number of Experts in $x_{0}$ -Predictor of DM: The

x_{0}

-predictor of DMCIR adopts a Mixture-Of-Experts architecture to generate the first prediction

{\hat{x}}_{0}^{(1)}

in Equation (13), which is used for prediction guidance of DM. In Figure 6, we show the recommendation performance of DMCIR with CiteULike ((a)) and ML25M ((b)) datasets to various number of experts

| E | \in {0, 1, 3, 5, 7}

. “

| E | = 0

” indicates that MoE in Figure 2 is removed. The other parameters were fixed as

β_{e} = 0.05

and

T = 400

.

In all datasets, the worst performance was observed when “

| E | = 0

”, indicating that MoE contributes to the recommendation performance. The performance increases as

| E |

increases to some point,

| E | = 3

for CiteULike and

| E | = 5

for ML25M, and then starts to decline in both datasets but the details are different. The performance gain caused by the MoE with the CiteULike dataset is smaller than that with ML25M, which shows significant improvement. The observation was caused by the characteristics of the datasets. The users and side information of items in the CiteULike dataset are specialists and abstracts of papers. In contrast, the users and side information of items in the ML25M dataset are ordinary people and tags assigned to movies. Therefore, the uncertainty in the relationship between user interactions and side information is lower in the CiteULike dataset, which allows the performance of the

x_{0}

-predictor without MoE to approach that of the MoE-enhanced

x_{0}

-predictor.

5.3. Embedding Quality (RQ3)

Generated embedding quality to the Refiner’s performance: The Refiner takes the collaborative embeddings calculated by DM-based generator to calculate refined collaborative embeddings for the final recommendation. Therefore, the relationship analysis between input and the final recommendation quality is important. To this end, we compared the performance of the DMCIR model with two different configurations and the performance of the two DMCIR+ models whose input collaborative embeddings are the output of each DMCIR model. Table 7 shows the result.

{DMCIR}_{0}

and

{DMCIR}_{o p t}

indicate that DMCIR with the number of experts is zero and with the best number of experts (three for CiteULike, five for ML25M), respectively.

{DMCIR}_{0}

+ and

{DMCIR}_{o p t}

+ use the outputs of

{DMCIR}_{0}

and

{DMCIR}_{o p t}

as inputs, respectively.

The performance of DMCIR+ is proportional to its input item collaborative embeddings. This indicates that the qualitative first prediction of item collaborative embeddings is essential to calculate more accurate refined embeddings. An interesting finding is that while the performance difference between

{DMCIR}_{0}

and

{DMCIR}_{o p t}

and the difference between

{DMCIR}_{0}

+ and

{DMCIR}_{o p t}

+ in CiteULike are similar, in ML25M the difference between

{DMCIR}_{0}

+ and

{DMCIR}_{o p t}

+ is much smaller compared to the difference between

{DMCIR}_{0}

and

{DMCIR}_{o p t}

. That indicates that, in the ML25M dataset, some information captured by the MoE architecture in DMCIR from item collaborative embeddings can be compensated by the refining process, which is re-considering user embeddings and user–item interactions. Thus, further research is required to extract and integrate information using integrated architecture. However, calculating high-quality collaborative embeddings upstream is still important to achieving top recommendation accuracy, especially at the top part of the recommendation list that users most frequently examine.

Side information quality to model performance: Since the proposed method relies on side information as a pivotal input, the quality of the side information is crucial. To investigate the relationship between the quality of side information and model performance, we evaluated our method using side information embeddings of varying quality. Specifically, we calculated degraded side information by randomly dropping p % of the original content prior to embedding calculation. Figure 7 shows the relative R@20 scores calculated with side information of different qualities. Across all datasets and model types, performance consistently degrades with lower-quality side information, demonstrating that side information quality is a critical factor in the effectiveness of the proposed method.

Embedding Transformation Analysis: Here, we analyze how the Refiner transforms its input collaborative embeddings—specifically, the ground-truth user embeddings and the generated item embeddings. Since the Refiner projects these inputs into a different space, it is not straightforward to directly compare the input and output embeddings of the Refiner. To address this, we evaluate the transformation by examining the relative angles between embeddings. We compute the pairwise cosine similarity and construct similarity matrices for each set of vectors. For each dataset, we obtain four matrices:

S_{U}^{i n}

,

S_{I}^{i n}

,

S_{U}^{r e}

,

S_{I}^{r e}

, representing the similarity matrices for users’ ground-truth embeddings, items’ generated embeddings, refined user embeddings, and refined item embeddings, respectively.

Figure 8 shows the value distribution of the elements in the absolute difference matrix

D_{x} = | S_{x}^{i n} - S_{x}^{r e} |

. Across datasets, the relative angles between user embeddings remain mostly unchanged. However, for item embeddings, the CiteULike dataset shows only minor changes, whereas the ML25M dataset exhibits more significant transformations. This observation is further illustrated in Figure 9, which displays side-by-side visualizations of item embeddings before and after refinement. The visualized distributions of embeddings in CiteULike are similar to each other, whereas in ML25M, they show more noticeable differences. We think that this difference is caused by the varying quality of side information. Compared to the abstracts of academic papers used in CiteULike, the tags associated with full-length movies in ML25M are less effective in representing item content. As a result, Refiner performs more substantial transformations on the embeddings generated from the less informative movie tags.

6. Conclusions

This paper presented a novel diffusion model framework as an add-on solution for improving cold item recommendation performance in collaborative filtering (CF) systems. Our diffusion model-based generator generates collaborative embeddings for cold items by effectively leveraging items’ side information. The second-phase refiner further enhances recommendation accuracy by utilizing simple neural network components and dropout regularization, ensuring effective refinement of embeddings with minimal system overhead.

Through extensive experiments on real-world datasets, we demonstrated that our approach outperforms state-of-the-art methods in cold item scenarios, particularly at the start of item introduction. The model’s performance and insights from hyperparameter sensitivity and ablation studies highlight the practical value and flexibility of integrating diffusion models into recommender systems. Our findings suggest that the diffusion models are a promising direction for addressing long-standing challenges in cold item recommendation.

Future work will explore four main directions: (1) Developing more effective predictor architectures for the diffusion model to enhance the accuracy of generated collaborative embeddings. (2) Designing an end-to-end architecture that tightly integrates the diffusion generation module and the refiner module to improve overall efficiency and performance. (3) Conducting a thorough study for a more comprehensive analysis of the noise scheduler’s stability in the diffusion model in the context of cold item recommendations. (4) Investigating structured comparison techniques to improve the reliability of item representations in cold-start recommendation settings.

Author Contributions

Conceptualization, J.H.; methodology, J.H. and S.C.; software, J.H.; validation, J.H. and S.C.; data curation, J.H. and S.C.; writing—original draft preparation, J.H.; writing—review and editing, J.H. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Dong-A University research fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/JungkyuHan/DMCIR (accessed on 30 March 2025).

Acknowledgments

This work was supported by the Dong-A University research fund.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CF	Collaborative Filter
DM	Diffusion Model

References

Su, X.; Khoshgoftaar, T.M. A Survey of Collaborative Filtering Techniques. Adv. Artif. Intell. 2009, 2009. [Google Scholar] [CrossRef]
Lam, X.N.; Vu, T.; Le, T.D.; Duong, A.D. Addressing cold-start problem in recommendation systems. In Proceedings of the 2nd International Conference on Ubiquitous Information Management and Communication, Suwon, Republic of Korea, 31 January–1 February 2008; pp. 208–211. [Google Scholar] [CrossRef]
Adomavicius, G.; Tuzhilin, A. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 2005, 17, 734–749. [Google Scholar] [CrossRef]
Yu, K.; Schwaighofer, A.; Tresp, V.; Xu, X.; Kriegel, H.P. Probabilistic memory-based collaborative filtering. IEEE Trans. Knowl. Data Eng. 2004, 16, 56–69. [Google Scholar] [CrossRef]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. Generalization of Model-Agnostic Meta-Learning Algorithms: Recurring and Unseen Tasks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 5469–5480. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning—Volume 70, Sydney, Australia, 6–11 August 2017; JMLR.org, ICML’17. pp. 1126–1135. [Google Scholar]
Lee, H.; Im, J.; Jang, S.; Cho, H.; Chung, S. MeLU: Meta-Learned User Preference Estimator for Cold-Start Recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; KDD’19; pp. 1073–1082. [Google Scholar] [CrossRef]
Dong, M.; Yuan, F.; Yao, L.; Xu, X.; Zhu, L. MAMO: Memory-Augmented Meta-Optimization for Cold-Start Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 23–27 August 2020; KDD’20. pp. 688–697. [Google Scholar] [CrossRef]
Lin, X.; Wu, J.; Zhou, C.; Pan, S.; Cao, Y.; Wang, B. Task-Adaptive Neural Process for User Cold-Start Recommendation. In Proceedings of the Web Conference 2021, Virtual, 19–23 April 2021; WWW’21. pp. 1306–1316. [Google Scholar] [CrossRef]
Pan, F.; Li, S.; Ao, X.; Tang, P.; He, Q. Warm Up Cold-Start Advertisements: Improving CTR Predictions via Learning to Learn ID Embeddings. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; SIGIR’19. pp. 695–704. [Google Scholar] [CrossRef]
Lu, Y.; Fang, Y.; Shi, C. Meta-Learning on Heterogeneous Information Networks for Cold-Start Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 23–27 August 2020; KDD’20. pp. 1563–1573. [Google Scholar] [CrossRef]
Zhu, Y.; Xie, R.; Zhuang, F.; Ge, K.; Sun, Y.; Zhang, X.; Lin, L.; Cao, J. Learning to Warm Up Cold Item Embeddings for Cold-Start Recommendation with Meta Scaling and Shifting Networks. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; SIGIR’21. pp. 1167–1176. [Google Scholar] [CrossRef]
Salha-Galvan, G.; Hennequin, R.; Chapus, B.; Tran, V.A.; Vazirgiannis, M. Cold Start Similar Artists Ranking with Gravity-Inspired Graph Autoencoders. In Proceedings of the 15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September–1 October 2021; RecSys’21. pp. 443–452. [Google Scholar] [CrossRef]
Li, J.; Jing, M.; Lu, K.; Zhu, L.; Yang, Y.; Huang, Z. From Zero-Shot Learning to Cold-Start Recommendation. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. AAAI’19. [Google Scholar] [CrossRef]
Zhao, X.; Ren, Y.; Du, Y.; Zhang, S.; Wang, N. Improving Item Cold-Start Recommendation via Model-Agnostic Conditional Variational Autoencoder. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; SIGIR’22. pp. 2595–2600. [Google Scholar] [CrossRef]
Hansen, C.; Hansen, C.; Simonsen, J.G.; Alstrup, S.; Lioma, C. Content-Aware Neural Hashing for Cold-Start Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; SIGIR’20. pp. 971–980. [Google Scholar] [CrossRef]
Oramas, S.; Nieto, O.; Sordo, M.; Serra, X. A Deep Multimodal Approach for Cold-Start Music Recommendation. In Proceedings of the 2nd Workshop on Deep Learning for Recommender Systems, Como, Italy, 27 August 2017; DLRS 2017. pp. 32–37. [Google Scholar] [CrossRef]
Raziperchikolaei, R.; Liang, G.; Chung, Y.j. Shared Neural Item Representations for Completely Cold Start Problem. In Proceedings of the 15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September–1 October 2021; RecSys’21. pp. 422–431. [Google Scholar] [CrossRef]
Volkovs, M.; Yu, G.; Poutanen, T. DropoutNet: Addressing Cold Start in Recommender Systems. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhu, Z.; Sefati, S.; Saadatpanah, P.; Caverlee, J. Recommendation for New Users and New Items via Randomized Training and Mixture-of-Experts Transformation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 1121–1130. [Google Scholar] [CrossRef]
Wei, Y.; Wang, X.; Li, Q.; Nie, L.; Li, Y.; Li, X.; Chua, T.S. Contrastive Learning for Cold-Start Recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; MM’21. pp. 5382–5390. [Google Scholar] [CrossRef]
Zhou, Z.; Zhang, L.; Yang, N. Contrastive Collaborative Filtering for Cold-Start Item Recommendation. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; WWW’23. pp. 928–937. [Google Scholar] [CrossRef]
Pazzani, M.J.; Billsus, D. Content-Based Recommendation Systems. In The Adaptive Web: Methods and Strategies of Web Personalization; Springer: New York, NY, USA, 2007; pp. 325–341. [Google Scholar] [CrossRef]
Bazzi, A.; Slock, D.T.M.; Meilhac, L. A Newton-type Forward Backward Greedy method for multi-snapshot compressed sensing. In Proceedings of the 2017 51st Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 29 October–1 November 2017; pp. 1178–1182. [Google Scholar] [CrossRef]
Kakiashvili, T.; Koczkodaj, W.W.; Woodbury-Smith, M. Improving the medical scale predictability by the pairwise comparisons method: Evidence from a clinical data study. Comput. Methods Programs Biomed. 2012, 105, 210–216. [Google Scholar] [CrossRef] [PubMed]
Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on International Conference on Machine Learning—Volume 37, Lille, France, 6–11 July 2015; ICML’15. pp. 2256–2265. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Virtual, 6–12 December 2020. NIPS’20. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8821–8831. [Google Scholar]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion Models in Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef] [PubMed]
Lovelace, J.; Kishore, V.; Wan, C.; Shekhtman, E.; Weinberger, K.Q. Latent diffusion for language generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. NIPS’23. [Google Scholar]
He, H.; Zhang, J.; Chen, H.; Chen, X.; Li, Z.; Chen, X.; Wang, Y.; Wang, C.; Xie, L. A diffusion-based framework for multi-class anomaly detection. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; AAAI Press: Washington, DC, USA, 2024. AAAI’24/IAAI’24/EAAI’24. [Google Scholar] [CrossRef]
Huang, L.; Zhang, H.; Xu, T.; Wong, K.C. MDM: Molecular diffusion model for 3D molecule generation. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. AAAI’23/IAAI’23/EAAI’23. [Google Scholar] [CrossRef]
Wang, W.; Xu, Y.; Feng, F.; Lin, X.; He, X.; Chua, T.S. Diffusion Recommender Model. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; SIGIR’23. pp. 832–841. [Google Scholar] [CrossRef]
Zhu, Y.; Wang, C.; Zhang, Q.; Xiong, H. Graph Signal Diffusion Model for Collaborative Filtering. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; SIGIR’24. pp. 1380–1390. [Google Scholar] [CrossRef]
Hou, Y.; Park, J.D.; Shin, W.Y. Collaborative Filtering Based on Diffusion Models: Unveiling the Potential of High-Order Connectivity. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; SIGIR’24. pp. 1360–1369. [Google Scholar] [CrossRef]
Walker, J.; Zhong, T.; Zhang, F.; Gao, Q.; Zhou, F. Recommendation via Collaborative Diffusion Generative Model. In Proceedings of the Knowledge Science, Engineering and Management, Singapore, 6–8 August 2022; pp. 593–605. [Google Scholar]
Choi, J.; Hong, S.; Park, N.; Cho, S.B. Blurring-Sharpening Process Models for Collaborative Filtering. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; SIGIR’23. pp. 1096–1106. [Google Scholar] [CrossRef]
Zhao, J.; Wenjie, W.; Xu, Y.; Sun, T.; Feng, F.; Chua, T.S. Denoising Diffusion Recommender Model. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; SIGIR’24. pp. 1370–1379. [Google Scholar] [CrossRef]
Ma, H.; Yang, Y.; Meng, L.; Xie, R.; Meng, X. Multimodal Conditioned Diffusion Model for Recommendation. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; WWW’24. pp. 1733–1740. [Google Scholar] [CrossRef]
Jiang, Y.; Xia, L.; Wei, W.; Luo, D.; Lin, K.; Huang, C. DiffMM: Multi-Modal Diffusion Model for Recommendation. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; MM’24. pp. 7591–7599. [Google Scholar] [CrossRef]
Ma, H.; Xie, R.; Meng, L.; Chen, X.; Zhang, X.; Lin, L.; Kang, Z. Plug-in diffusion model for sequential recommendation. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. AAAI’24/IAAI’24/EAAI’24. [Google Scholar] [CrossRef]
Yang, Z.; Wu, J.; Wang, Z.; Yuan, Y.; Wang, X.; He, X. Generate what you prefer: Reshaping sequential recommendation via guided diffusion. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. NIPS’23. [Google Scholar]
Niu, Y.; Xing, X.; Jia, Z.; Liu, R.; Xin, M.; Cui, J. Diffusion Recommendation with Implicit Sequence Influence. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; WWW’24. pp. 1719–1725. [Google Scholar] [CrossRef]
Wang, Y.; Liu, Z.; Wang, Y.; Zhao, X.; Chen, B.; Guo, H.; Tang, R. Diff-MSR: A Diffusion Model Enhanced Paradigm for Cold-Start Multi-Scenario Recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Mérida, Mexico, 4–8 March 2024; WSDM’24. pp. 779–787. [Google Scholar] [CrossRef]
Wei, T.R.; Fang, Y. Diffusion Models in Recommendation Systems: A Survey. arXiv 2025. [Google Scholar] [CrossRef]
Luo, C. Understanding Diffusion Models: A Unified Perspective. arXiv 2022, arXiv:2208.11970. [Google Scholar]
Pan, X.; Chen, Y.; Tian, C.; Lin, Z.; Wang, J.; Hu, H.; Zhao, W.X. Multimodal Meta-Learning for Cold-Start Sequential Recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; CIKM’22. pp. 3421–3430. [Google Scholar] [CrossRef]
Sun, Y.; Han, J. Mining Heterogeneous Information Networks: A Structural Analysis Approach. SIGKDD Explor. Newsl. 2013, 14, 20–28. [Google Scholar] [CrossRef]
Fu, T.y.; Lee, W.C.; Lei, Z. HIN2Vec: Explore Meta-Paths in Heterogeneous Information Networks for Representation Learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; CIKM’17. pp. 1797–1806. [Google Scholar] [CrossRef]
Wang, S.; Zhang, K.; Wu, L.; Ma, H.; Hong, R.; Wang, M. Privileged Graph Distillation for Cold Start Recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; SIGIR’21. pp. 1187–1196. [Google Scholar] [CrossRef]
Togashi, R.; Otani, M.; Satoh, S. Alleviating Cold-Start Problems in Recommendation through Pseudo-Labelling over Knowledge Graph. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual, 8–12 March 2021; WSDM’21. pp. 931–939. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends® Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Ouyang, W.; Zhang, X.; Ren, S.; Li, L.; Zhang, K.; Luo, J.; Liu, Z.; Du, Y. Learning Graph Meta Embeddings for Cold-Start Ads in Click-Through Rate Prediction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; SIGIR’21. pp. 1157–1166. [Google Scholar] [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Virtual, 6–12 December 2020. NIPS’20. [Google Scholar]
Nichol, A.; Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. arXiv 2021, arXiv:2102.09672. [Google Scholar]
Jabri, A.; Fleet, D.J.; Chen, T. Scalable adaptive computation for iterative generation. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. ICML’23. [Google Scholar]
Kingma, D.P.; Salimans, T.; Welling, M. Variational dropout and the local reparameterization trick. In Proceedings of the 29th International Conference on Neural Information Processing Systems—Volume 2, Montréal, QC, Canada, 7–12 December 2015; NIPS’15. pp. 2575–2583. [Google Scholar]
Baldacchino, T.; Cross, E.J.; Worden, K.; Rowson, J. Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems. Mech. Syst. Signal Process. 2016, 66-67, 178–200. [Google Scholar] [CrossRef]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; SIGIR’20. pp. 639–648. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NeurIPS’17. pp. 6000–6010. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; ICML’15. Volume 17, pp. 448–456. [Google Scholar]
Wang, C.; Blei, D.M. Collaborative Topic Modeling for Recommending Scientific Articles. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011; KDD’11. pp. 448–456. [Google Scholar] [CrossRef]
Klema, V.; Laub, A. The singular value decomposition: Its computation and some applications. IEEE Trans. Autom. Control 1980, 25, 164–176. [Google Scholar] [CrossRef]
Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 2015, 5, 1–19. [Google Scholar] [CrossRef]
Vig, J.; Sen, S.; Riedl, J. The Tag Genome: Encoding Community Knowledge to Support Novel Interaction. ACM Trans. Interact. Intell. Syst. 2012, 2, 1–44. [Google Scholar] [CrossRef]
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montréal, QC, Canada, 19–21 June 2009; UAI’09. pp. 452–461. [Google Scholar]
Ricci, F.; Rokach, L.; Shapira, B.; Kantor, P.B. Recommender Systems Handbook, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Järvelin, K.; Kekäläinen, J. Cumulated Gain-Based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 2002, 20, 422–446. [Google Scholar] [CrossRef]
Lin, S.; Liu, B.; Li, J.; Yang, X. Common Diffusion Noise Schedules and Sample Steps are Flawed. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 5392–5399. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed two-phase cold item recommender.

Figure 2. Architecture of the

x_{0}

-predictor

{\hat{x}}_{θ} (x_{t}, t, c)

.

Figure 2. Architecture of the

x_{0}

-predictor

{\hat{x}}_{θ} (x_{t}, t, c)

.

Figure 3. Architecture of the Refiner.

Figure 4. Original signal strength vs. timesteps.

Figure 5. Performance with regard to the maximum timestep T.

Figure 6. Performance with regard to the number of experts

| E |

.

Figure 6. Performance with regard to the number of experts

| E |

.

Figure 7. Performance with regard to the content quality.

Figure 8. Distribution of pairwise cosine similarity difference.

Figure 9. T-SNE visualization of item embeddings.

Table 1. Notations.

Notation	Explanation
u, i	A user, an item
U, I, $I_{w}$ , $I_{c}$	The set of users, items, warm items, cold items
$e_{u}$ , $e_{i}$	The ground-truth collaborative embeddings of u, and i
$c_{i}$	Side information embedding of i
$t e_{t}$	Time step embedding of timestep t
$R^{t r}$ , $R^{t e}$	training and test data
$L_{u}^{k}$	Top-k cold item recommendation list for u
$x_{0}$	The original dataset of DM. It indicates $e_{i}$ in this paper
$d_{c o n d}$ , $d_{c f}$	The dimension of side information ( $c_{i}$ ), collaborative embeddings ( $e_{u}, e_{i}$ )
$d_{t i m e}$	The dimension of time step embedding ( $t e_{t}$ )
$r_{u i}$	The interaction indicator between user u and item i
${\hat{r}}_{u i}$	The predicted interaction indicator between u and i
E	The set of experts in Mixture-of-Experts architecture

Table 2. Dataset statistics.

Dataset	Training Set				Evaluation Set			Test Set
Dataset	Users	Items	Clicks	Sparsity	Users	Items	Clicks	Users	Items	Clicks
CiteULike	5551	13,584	164,210	99.78%	5551	1018	13,037	5551	2378	27,739
ML25M	162,466	11,053	11,369,010	99.37%	55,056	829	519,586	161,053	1934	3,541,109

Table 3. Hyperparameters of DMCIR/DMCIR+.

Dataset	Common		DMCIR					DMCIR+
Dataset	$d_{cf}$	$d_{cond}$	$β_{s}$	$β_{e}$	$T$	$\| E \|$	Optimizer (Lr)	Optimizer (Lr)
CiteULike	200	300	0.0001	0.05	400	3	Adam (0.001)	Momentum SGD (0.005)
ML25M	128	400	0.0001	0.05	300	5	Adam (0.001)	Momentum SGD (0.005)

Table 4. Overall recommendation performance.

Dataset	Metric	DONet	Heater	CVAR	CCFC	GME	DMCIR	DMCIR+
CiteULike	P@10	0.099	0.119	0.112	0.119	0.116	0.120	0.127
	P@20	0.076	0.089	0.088	0.089	0.087	0.090	0.095
	P@50	0.050	0.055	0.054	0.054	0.055	0.054	0.058
	R@10	0.213	0.256	0.251	0.273	0.248	0.263	0.278
	R@20	0.325	0.372	0.376	0.389	0.361	0.379	0.401
	R@50	0.506	0.554	0.557	0.557	0.532	0.549	0.581
	N@10	0.223	0.274	0.262	0.276	0.265	0.277	0.291
	N@20	0.270	0.313	0.304	0.313	0.311	0.317	0.334
	N@50	0.327	0.360	0.351	0.347	0.356	0.353	0.376
ML25M	P@10	0.282	0.328	0.281	0.287	0.231	0.309	0.330
	P@20	0.236	0.270	0.237	0.232	0.186	0.255	0.270
	P@50	0.168	0.184	0.168	0.157	0.127	0.176	0.185
	R@10	0.194	0.224	0.192	0.205	0.138	0.210	0.230
	R@20	0.301	0.343	0.301	0.301	0.207	0.332	0.346
	R@50	0.485	0.523	0.489	0.462	0.323	0.503	0.531
	N@10	0.404	0.466	0.399	0.418	0.332	0.436	0.471
	N@20	0.472	0.531	0.465	0.466	0.375	0.503	0.535
	N@50	0.590	0.636	0.588	0.562	0.455	0.618	0.645

Boldface type: the best performance. Underline: the second-best performance.

Table 5. Overall elapse time (in second) for recommendation.

Dataset	Users	Items	Item Gen	Item Refine	User Refine	Rec
CiteULike	5551	16,980	36.840	0.050	0.010	0.048
ML25M	162,540	13,816	20.640	0.029	0.121	0.874

Table 6. Performance of DMCIR with regard to the noise schedule.

Dataset	Metric	$β_{e} = 0.02$	$β_{e} = 0.03$	$β_{e} = 0.04$	$β_{e} = 0.05$	$β_{e} = 0.06$
CiteULike	P@10	0.118	0.120	0.120	0.120	0.120
	P@20	0.089	0.090	0.090	0.090	0.090
	P@50	0.054	0.054	0.054	0.055	0.055
	R@10	0.260	0.263	0.263	0.263	0.262
	R@20	0.373	0.377	0.379	0.379	0.376
	R@50	0.545	0.546	0.549	0.550	0.549
	N@10	0.272	0.275	0.277	0.275	0.277
	N@20	0.312	0.314	0.317	0.317	0.317
	N@50	0.351	0.352	0.353	0.355	0.354
ML25M	P@10	0.307	0.292	0.294	0.299	0.296
	P@20	0.254	0.236	0.238	0.239	0.238
	P@50	0.175	0.174	0.176	0.176	0.176
	R@10	0.209	0.226	0.229	0.231	0.229
	R@20	0.322	0.355	0.357	0.357	0.353
	R@50	0.499	0.497	0.503	0.501	0.499
	N@10	0.435	0.445	0.446	0.451	0.449
	N@20	0.501	0.523	0.526	0.526	0.522
	N@50	0.613	0.612	0.620	0.618	0.616

Boldface type: the best performance. Underline: the second-best performance.

Table 7. Embedding quality to the Refiner’s performance.

Dataset	Metric	${DMCIR}_{0}$	${DMCIR}_{opt}$	${DMCIR}_{0}$ +	${DMCIR}_{opt}$ +
CiteULike	P@10	0.119	0.120	0.125	0.127
	P@20	0.089	0.090	0.095	0.095
	P@50	0.054	0.054	0.058	0.058
	R@10	0.259	0.263	0.274	0.278
	R@20	0.376	0.379	0.398	0.401
	R@50	0.546	0.549	0.581	0.581
	N@10	0.273	0.277	0.288	0.291
	N@20	0.314	0.317	0.334	0.334
	N@50	0.352	0.353	0.375	0.376
ML25M	P@10	0.273	0.309	0.326	0.330
	P@20	0.235	0.255	0.269	0.270
	P@50	0.169	0.176	0.185	0.185
	R@10	0.183	0.210	0.226	0.230
	R@20	0.296	0.332	0.344	0.346
	R@50	0.484	0.503	0.529	0.531
	N@10	0.384	0.436	0.464	0.471
	N@20	0.462	0.503	0.532	0.535
	N@50	0.492	0.618	0.645	0.645

Boldface type: the best performance. Underline: the second-best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, J.; Chun, S. Diffusion Model as a Base for Cold Item Recommendation. Appl. Sci. 2025, 15, 4784. https://doi.org/10.3390/app15094784

AMA Style

Han J, Chun S. Diffusion Model as a Base for Cold Item Recommendation. Applied Sciences. 2025; 15(9):4784. https://doi.org/10.3390/app15094784

Chicago/Turabian Style

Han, Jungkyu, and Sejin Chun. 2025. "Diffusion Model as a Base for Cold Item Recommendation" Applied Sciences 15, no. 9: 4784. https://doi.org/10.3390/app15094784

APA Style

Han, J., & Chun, S. (2025). Diffusion Model as a Base for Cold Item Recommendation. Applied Sciences, 15(9), 4784. https://doi.org/10.3390/app15094784

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diffusion Model as a Base for Cold Item Recommendation

Abstract

1. Introduction

2. Related Work

2.1. Cold Item Recommendation

2.2. Recommendation with Diffusion Models

3. Cold Item Recommender with a Diffusion Model Base

3.1. Cold Item Recommendation

3.2. Preliminary: Diffusion Model

3.2.1. Overview of the Proposed Method

3.2.2. DM-Based Generator

3.3. Second-Phase Refiner

4. Evaluation Environment

4.1. Datasets

4.2. Evaluation Metrics

4.3. Comparison Methods

5. Result Analysis

5.1. Overall Performance (RQ1)

5.2. DM’s Hyperparameter Effect (RQ2)

5.3. Embedding Quality (RQ3)

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI