Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data

Bao, Lin; Wang, Lina; Xu, Biao; Yang, Hang; Peng, Yumeng

doi:10.3390/math13233777

Open AccessArticle

Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data

by

Lin Bao

¹,

Lina Wang

¹,

Biao Xu

^2,*

,

Hang Yang

¹ and

Yumeng Peng

¹

College of Automation, Jiangsu University of Science and Technology, Zhenjiang 212000, China

²

College of Engineering, Shantou University, Shantou 515063, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(23), 3777; https://doi.org/10.3390/math13233777

Submission received: 18 September 2025 / Revised: 25 October 2025 / Accepted: 7 November 2025 / Published: 25 November 2025

(This article belongs to the Special Issue Artificial Intelligence, Algorithms, and Databases: Innovations and Cross-Disciplinary Impact)

Download

Browse Figures

Versions Notes

Abstract

Personalized search and recommendation algorithms for multi-modal data have attracted widespread attention. However, existing methods often struggle with effectively integrating multi-source information and performing global search in complex optimization problems. To address these limitations, this paper proposed a denoising diffusion model-driven adaptive estimation of a distribution algorithm integrating multi-modal data. Multi-modal user-generated contents are extensively collected, such as users’ interaction behaviors, category tags, text comments, images, social network relationships, etc. A user interest preference model based on a denoising diffusion model is established by learning the fusion representation of multi-modal data, which extracts user preference features. The surrogate model based on user preferences and adaptive estimation of distribution strategies is presented in the framework of an estimation of distribution algorithm. A surrogate-driven adaptive estimation of distribution algorithm is designed to align with users’ cognitive experiences and behavioral patterns, thereby enhancing the optimization capability of the personalized search algorithm. Additionally, a dynamic model management mechanism is established to update the user interest preference model with new available modal information, which tracks the changes in users’ interest preferences in real-world scenarios. It assists users in efficiently filtering items that match their preferences from large-scale information sources. Extensive experiments on general public datasets demonstrate the feasibility, effectiveness, and superiority of the proposed algorithm, confirming its improvements in both search efficiency and recommendation performance for a personalized recommendation algorithm.

Keywords:

multi-modal data; personalized search; denoising diffusion model; estimation of distribution algorithm; surrogate model

MSC:

68T05; 68T07; 68T20; 68U35; 68W50

1. Introduction

With the rapid development of information technology, the internet, big data, and other technologies, network information is growing explosively. It is almost impossible to control the generation and dissemination of information in the digital age. The sources, types, and forms of information have become diverse and differentiated. At the same time, the growth in the number of users and their generated data led to the emergence of new ways of connection, collaboration, and sharing in the network. The scale and complexity of data have reached an unprecedented level, giving rise to the problem of information overload [1,2,3]. In numerous fields such as e-commerce, social media, news information, and video platforms, a vast amount of product information is constantly emerging and overwhelming. Users’ available information has far exceeded their processing capacity, making it difficult to search for the content that they are interested in or need. The application platform hopes to help users find the items that they are interested in from massive amounts of information, thereby enhancing users’ experiences and improving commercial benefits. Therefore, personalized search and recommendation systems are born at the right moment, attracting extensive attention from researchers and practitioners [1,4,5,6,7]. Personalized search and recommendation methods utilize multi-source and multi-modal information, including text, image, audio, video, and other information, with users’ preferences. They mine feature information and potential correlation to predict users’ potential intentions or interest preferences in the decision-making process for personalized recommendation services. It has become a research hotspot in the field of artificial intelligence [8,9,10,11].

With the development of machine learning and deep learning, many personalized recommendation methods have been proposed, such as collaborative filtering (CF) [12,13], Bayesian personalized ranking (BPR) [14,15], matrix factorization (MF) [16,17], neural-CF [18], etc., to achieve good recommendation effects. However, users’ explicit feedback in the network requires more decision-making costs, and users need to think carefully before providing quantifiable feedback. Therefore, a recommendation system digs deeply into implicit feedback, such as clicking, liking, following, purchasing, and various behaviors, to predict users’ interest preferences for personalized recommendation tasks. Although these recommendation methods have achieved remarkable results, they still face severe challenges of data sparsity, cold start problems, and dynamic preference modeling. Deep generative models can generate new samples by learning the latent distribution of data in the incomplete information situation of missing data and noise. They possess a powerful capability of feature extraction and data modeling. They have gradually become an important method for handling complex multi-source heterogeneous data, and have been successfully applied in tasks such as image generation, text generation, and so on. However, traditional generative models face the challenges of how to integrate multi-modal data to effectively extract feature information to promote practical applications. In recent years, the diffusion model has been an emerging generative model. It simulates the propagation process of information among graph nodes to capture the similarities and correlations between nodes to achieve the feature extraction of multi-source heterogeneous data and information fusion across data modalities. A diffusion model is applied to the recommendation system, which will bring new opportunities to the research fields of personalized search and recommendation.

The needs and interests of different users vary for the same task (such as purchasing books, searching for movies, etc.), and those of the same user at different times may be different in practical application scenarios. In addition, users’ knowledge experiences, potential needs, and behavioral motivations may undergo dynamic changes with the influence of environment, time, and information. When dealing with users’ dynamic personalized demands, the coarse-grained modeling of users’ preferences limits the improvement of personalized recommendation methods. Due to the difficulty in the quantitative representation and dynamic change of users’ preferences, it is necessary to design an appropriate personalized search and recommendation framework. It models users’ interest preferences and provides feedback based on users’ information, which can accurately capture feature information to predict users’ potential preferences and dynamic trends. This further guides the model to adaptively optimize and adjust the dissemination mechanism in the iterative searching process for personalized recommendation tasks. It will improve the search efficiency, recommendation effect, and robustness of personalized search and recommendation algorithms. However, personalized search and recommendation methods still face the enormous challenges of user interaction sparsity, cold start, and long-tail recommendation. It is difficult to accurately capture users’ potential intentions and interest preferences, resulting in poor modeling of dynamic user interest preferences.

To address the aforementioned issues, this paper proposed a denoising diffusion model-driven adaptive estimation of distribution algorithm integrating multi-modal data. User-generated contents are extensively collected to explore multi-modal data information with users’ interest preferences. Multi-source multi-modal data are effectively integrated by the diffusion model. A user interest preference model based on a denoising diffusion model is established to extract the deep-seated interest preference features of users and the development pattern of user interests. In the framework of the estimation of distribution algorithm (EDA), a surrogate model based on user preferences and adaptive estimation of distribution strategies is designed in multi-modal data fusion mode to simulate users’ cognitive experiences and behavioral patterns to guide the direction of evolutionary optimization search. Meanwhile, a dynamic model management mechanism is presented to update the user interest preference model and related models to timely track users’ interest preferences. It helps users filter out items that match their interest preferences from a vast amount of information for personalized search and recommendation tasks. The feasibility, effectiveness, and superiority of the proposed algorithm have been verified through a large number of experiments on actual multi-domain public datasets. It enhances the global exploration and local development capabilities of the personalized search algorithm, which improves users’ experience and satisfaction on recommendation system platforms. It has good scalability and adaptability.

The contributions of this paper mainly include three aspects. (1) For dynamic personalized search and recommendation tasks, a user interest preference model based on the denoising diffusion model is constructed by considering multi-modal information fusion and cross-modal alignment representation. It can understand multi-modal information with users’ interests to obtain the preference features of the users. (2) The adaptive estimation of distribution strategies based on the user interest preference model is designed in the framework of the estimation of distribution algorithm. It refines users’ intention representation and interest tendency from a micro perspective to generate new individuals with the user preference to fit the dynamic change in users’ interest preferences. (3) A denoising diffusion model-driven user preference surrogate model is established to estimate the fitness of individuals to track users’ interest preferences for guiding the forward direction of the personalized evolutionary search. It helps efficiently complete personalized search and recommendation tasks.

The remainder of this study is organized as follows. Section 2 introduces the notations of our study and related work. In Section 3, the proposed algorithm is described in detail. Section 4 presents comparative experiment results and corresponding analysis. Finally, the conclusion is presented.

2. Related Work

2.1. Mathematical Description for Personalized Search Problems with User-Generated Contents

Personalized search with user-generated contents (UGCs) involves retrieving optimization targets that align with users’ potential needs and personalized interest preferences from a dynamically evolving search space of massive multi-source heterogeneous data. This process ultimately generates personalized item lists—comprising products or solutions—tailored to each user. At its core, this task represents a complex and dynamic optimization problem with qualitative objectives. In the process of personalized search, users evaluate and make decisions regarding retrieved items based on their own cognitive experiences and interest preferences. However, users’ cognitive experience and interest preferences are often diverse, ambiguous, uncertain, and continually evolving. As a result, the definition of users’ satisfactory solutions is highly subjective and varies significantly among individuals. Consequently, both the search outcomes and recommendation effectiveness are ultimately determined by users’ subjective judgments. Here, the objective function

f (X_{u, v})

for the personalized search problem with UGCs can be defined as follows:

\{\begin{cases} Y = f (X_{u, v}; θ_{f}) \\ s . t . u \in U, v \in V \end{cases}

(1)

where

U = \{u_{1}, u_{2}, \dots, u_{|U|}\}

is the user set;

V = \{v_{1}, v_{2}, \dots, v_{|V|}\}

is the set of items (the feasible solution space), usually the feasible solution space

V

is large and sparse. The preference of the current user

u

on items

v

is expressed as a model function

f (X_{u, v})

with learnable parameters

θ_{f}

.

A personalized search provides users with a list of recommended items

T o p N

from the feasible solution space, which comprises

N

items of higher value

f (X_{u, v})

that are likely to align with their interests. Through the presentation of these relevant items, the system completes the search and recommendation tasks, thereby stimulating user exploration and improving overall experience and satisfaction.

2.2. Recommendation Algorithms Integrating Multi-Modal Data

Various extractable and analyzable multi-modal data, including textual, visual, and auditory information, are frequently utilized in recommendation systems as important supplementary information to users’ interaction behaviors. These data enrich the representation features of users and items, thereby partially alleviating data sparsity and the cold-start problem. The integration of these multi-modal data with the collaborative information is critical. It enables the effective fusion of single-modal representations and multi-modal features, thus maintaining the integrity and diversity of the combined information. These comprehensive and accurate extracted features of user preferences and item representations are fundamental to recommendation algorithms that integrate multi-modal data.

In early research, He et al. [19] extracted items’ image features through convolutional neural networks (CNN) and used matrix factorization to predict users’ preferences. The visual Bayesian personalized ranking (VBPR) model was proposed to use Bayesian personalized ranking to alleviate the cold start problem. Kim et al. [20] combined CNN and probabilistic matrix factorization (PMF) to capture the context information of documents to present a convolutional matrix factorization model (ConvMF) for improving the recommendation accuracy. Chen et al. [21] utilized the ResNet-152 model to extract modal information in images and video to propose an attentive collaborative filtering (ACF) method. Wei et al. [22] utilized multi-modal information of vision, audio, and text to construct a user-short video bipartite graph. A multi-modal graph convolution network (MMGCN) is proposed by the topological structure of neighboring nodes to enrich the representation of each node. It can learn high-order features from the user-short video bipartite graph to improve the recommendation performance. Wang et al. [23] used a pre-trained word embedding model to represent text features and utilized convolutional neural networks to obtain different single-level visual features from different pooling layers. A movie recommendation system based on visual recurrent convolutional matrix factorization (VRConvMF) is proposed to improve the accuracy of the recommendation system. Yang et al. [24] designed the multi-modal module, attention module, and multi-head residual network module to extract the image features of the video cover to expand the learned feature set. The multi-head multi-modal deep interest network (MMDIN) is proposed to enhance the representational ability, predictive performance, and recommendation performance. Deng et al. [25] proposed a recommendation model based on multi-modal fusion and behavior expansion. A learning-query multi-modal fusion module is designed to perceive the dynamic content of flow fragments and handle complex multi-modal interactions. A graph-guided interest expansion method is presented to learn the representations of users and information flows in a multi-modal attribute large-scale graph. Yan et al. [26] presented a pre-trained model that can extract high-quality multi-modal embedding representations. A content interest-aware supervised fine-tuning is designed to guide the alignment of user preference embedding representations through users’ behavior signals to bridge the semantic gap between contents and user interests. A multi-modal content interest modeling paradigm for user behavior modeling is proposed by integrating multi-modal embedding and ID-based collaborative filtering into a unified framework.

The above-mentioned methods take into account multi-modal information, such as images, videos, text, contexts, and sounds, to conduct the feature extraction and information fusion of multi-modal data from multiple aspects. This has improved the performance of the recommendation system to a certain extent. However, the existing methods do not fully consider the correlations, differences, and dynamics among different modalities, resulting in incomplete multi-modal information mining and insufficient cross-modal alignment representation. It will affect the ability to model users’ interest preferences and lead to an insufficient understanding of users’ potential needs and deep interest preferences. Meanwhile, users’ preference information is different in various types of data, and its contribution to establishing a user interest preference model varies when conducting multi-modal information fusion. If an appropriate information fusion method is not used, a large amount of noise will be introduced, and there will be deficiencies in feature fusion and semantic association modeling. It will affect the accuracy and effectiveness of the recommendation system.

2.3. Diffusion Models

In recent years, diffusion models have achieved significant breakthroughs in domains including computer vision and natural language processing, driven by their paramount capabilities in data generation, representation learning, and sequence modeling. Consequently, they have risen as an emerging research hotspot [27,28,29,30,31]. Technically, a diffusion model is a deep generation method whose core concept comprises a forward diffusion process of progressively adding Gaussian noise to the data and a reverse process that learns to reconstruct the data through iterative denoising. This methodology has been introduced into recommender systems, providing a novel generative modeling paradigm that can effectively alleviate data sparsity and support the recommendation integrating multi-modal information [32,33,34,35].

Li et al. [36] proposed a sequence recommendation based on a diffusion model. Items are represented as a model distribution that adaptively reflects users’ multiple interests and multi-aspect items. The target item is embedded into the Gaussian distribution by adding noise, which is applied to the distribution representation generation and uncertainty injection of the sequential item. Based on users’ historical interactions, Gaussian noise is reverse-transformed into the representation of target items. Zhao et al. [37] presented a denoising diffusion recommendation model. It utilizes the multi-step denoising process of the diffusion model to inject controllable Gaussian noise in the forward process and iteratively remove noise in the reverse denoising process to robust the embedding representations of users and items. Jiang et al. [38] proposed a knowledge graph diffusion model for recommendation. By integrating generated diffusion models with data augmentation paradigms, it achieves robust knowledge graph representation learning and promotes the collaboration between knowledge-aware item semantics and collaborative relationship modeling. A collaborative knowledge graph convolution mechanism with collaborative signals reflecting user-item interaction patterns was introduced to guide the knowledge graph diffusion process. Cui et al. [39] utilized context information to generate reasonable enhanced views to propose an enhanced sequential recommendation with context-aware diffusion-based contrastive learning. Xia et al. [40] proposed an anisotropic diffusion model for collaborative filtering in the spectral domain. It mapped the user interaction vector to the spectral domain and parameterized the diffusion noise to align with graph frequency. These anisotropic diffuses retained significant low-frequency components to maintain a high signal-to-noise ratio. Further, the conditional denoising network is adopted to encode users’ interactions to restore the true preferences from noise data. However, diffusion models are confronted with challenges such as high computational demands and a lack of interpretability. Consequently, the future research trajectory for diffusion-based recommendation algorithms, through continued model optimization and cross-domain integration, will concentrate on avenues including efficient reasoning and causal modeling. This focus is instrumental in facilitating the transition of these models from research to practical application in real-world business scenarios.

3. Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data

3.1. The Proposed Algorithm Framework

The framework of the proposed denoising diffusion model-driven adaptive estimation of distribution algorithm integrating multi-modal data (DDM-AEDA) is shown in Figure 1.

The proposed algorithm mainly consists of four parts: (1) multi-modal data processing, including data acquisition and vectorized representation; (2) a user interest preference model based on the denoising diffusion model; (3) a surrogate model-driven adaptive estimation of distribution algorithm; (4) a model management mechanism for dynamically tracking the evolution of users’ interest preferences.

3.2. Multi-Modal Data Processing and Its Fusion Learning Representation

A substantial volume of multi-modal user-generated content is widely collected in network environments. These data encompass historical user interactions (such as ratings and textual comments), item content information (including category tags, descriptions, and images), and social network relationships. These diverse sources contain numerous explicit and implicit user preferences. The proposed method fully utilizes the above-mentioned knowledge, which reflects user interests from different perspectives, to alleviate data sparsity in big data environments, thereby achieving a comprehensive performance improvement for personalized search and recommendation algorithms.

This section details the preprocessing and feature representation techniques applied to multi-modal user-generated contents.

(1): Users’ ratings: Users’ ratings on items are represented as a user rating matrix $R = {[r_{i j}]}_{|U| \times |V|}$ , where $r_{i j}$ represents the rating of user $u_{i}$ for item $v_{j}$ . The larger the value, the more the user $u_{i}$ likes the item $v_{j}$ . Users’ ratings explicitly express the degree of users’ preferences for items.
(2): Items’ category tags: Items’ category tags briefly describe specific contents or feature information. To a certain extent, they reflect users’ interest preferences. Here, multi-hot encoding is adopted. Based on the discrete values of the limited category tags, the category tags of the item individual $v$ are vectorized as $c = [c_{1}, c_{2}, \dots, c_{i}, \dots, c_{n_{1}}]$ , where $c_{i}$ is the category tag of item $v$ and $n_{1}$ is the total number of category tags of all items. If $c_{i} = 1$ , it indicates that the item $v$ contains the $i$ th category tag; otherwise, it indicates that the item $v$ does not contain the $i$ th category tag.
(3): Text comments: Text comments contain a large amount of users’ implicit preference information. Users express their latent needs and interest preferences through emotional tendencies and semantic information in text comments. Users’ text comments are collected to carry out natural language preprocessing and text vectorization representation. An unsupervised Doc2Vec model was trained on a corpus constructed from the dataset, resulting in a feature representation model that encodes the latent semantic information of users’ textual comments.

By considering the interrelationships among context, word order, and semantics, the Doc2Vec model distills high-dimensional sparse word vectors into low-dimensional dense feature vectors. It thereby learns fixed-length feature representations from variable-length text content, such as sentences, paragraphs, or documents. A text comment vectorized representation matrix with items’ ID index is generated, denoted as

T = {[t_{1}, t_{2}, \dots, t_{i}, \dots, t_{|V|}]}^{T}, T \in ℝ^{|V| \times n_{2}}

, where

t_{i} = [t_{i 1}, t_{i 2}, \dots, t_{i n_{2}}]

represents the text comment vectorized representation of the item

v_{i}

and

n_{2}

is the length of that.

(4): Social network relationships: Social network relationships express the friendship or similarity between users. Usually, neighboring users have similar interests or hobbies, so these social network relationships imply a large amount of users’ preference information. Here, Pearson correlation is adopted to calculate the Pearson similarity coefficient $S i m (u_{i}, u_{j})$ between users. The personalized recommendation algorithm leverages data from neighboring users to help infer items or content that a given user might prefer.
(5): Image information: A pre-trained ResNet model [41] is utilized to extract high-dimensional visual feature vectors $g_{i} \in ℝ^{n_{3}}$ of items’ images. The vectorized representation of items’ images is expressed as $G = [g_{1}, g_{2}, \dots, g_{n_{3}}] \in ℝ^{|V| \times n_{3}}$ , where $n_{3}$ is the length of the vectorized representation of those images.
(6): Multi-modal information fusion and cross-modal alignment representation: Through the fully connected layer, the vector representations of items’ tags, comments, and images are mapped to a shared low-dimensional space to obtain the embedded representation $c_{e m b}$ , $t_{e m b}$ , and $g_{e m b}$ of those.

c_{e m b} = W_{c} c + b_{c} \in ℝ^{n_{1}}

(2)

t_{e m b} = W_{t} t + b_{t} \in ℝ^{n_{2}}

(3)

g_{e m b} = W_{g} g + b_{g} \in ℝ^{n_{3}}

(4)

where

W_{c}

,

W_{t}

, and

W_{g}

are, respectively, the weight of the embedded representation of item tags, comments, and images;

b_{c}

,

b_{t}

, and

b_{g}

are, respectively, the bias of those.

These embedded representations are concatenated into a consistent learning representation

x

:

x = C o n c a t (c_{e m b}, t_{e m b}, g_{e m b}) \in ℝ^{n_{1} + n_{2} + n_{3}}

(5)

The above process achieves the fusion representation of multi-modal data information to obtain the genotype representation

x

of the item individual

v

.

3.3. User Interest Preference Model Based on Denoising Diffusion Model

The collection of items that users like is screened to establish a dominant population

P_{a}

containing users’ positive preference information. A training dataset

T_{a} = \{x_{i} (i = 1, 2, \dots, |P_{a}|)\}

is formed by combining the genotype representation of individual items. A user interest preference model based on the denoising diffusion model is constructed and is shown in Figure 2.

The training dataset is fed into the user interest preference model based on the denoising diffusion model. The training process mainly comprises two key stages: a forward process and a reverse process. In the forward process, the original data are progressively corrupted through the gradual addition of Gaussian noise.

x_{0}

represents the original user behaviors. It simulates a Markov Chain to obtain a series of intermediate states

x_{1}, x_{2}, \dots, x_{k}

, where k is the size of the time step, and the distribution of each step is a conditional Gaussian distribution. The forward process is designed to operationalize the concept of data corruption. The addition of Gaussian noise directly models the effects of implicit feedback noise, such as accidental clicks or non-purposeful browsing. Given the current state

x_{k}

, the next state

x_{k + 1}

is obtained:

x_{k + 1} = \sqrt{1 - β_{k}} x_{k} + \sqrt{β_{k}} ε_{k}

(6)

where

β_{k}

is the noise variance, controlling the noise intensity at each step;

ε_{k} \sim N (0, I)

is the noise of standard normal distribution.

The reverse process is also modeled as a Markov process, starting from the noise sample

x_{k} \sim N (0, I)

to recover

\overset{\land}{x_{0}}

by gradually removing the noise. The denoising distribution at each step is:

p_{θ} (\overset{\land}{x_{k - 1}} | \overset{\land}{x_{k}}) = N (\overset{\land}{x_{k - 1}}; μ_{θ} (\overset{\land}{x_{k}}, k), \sum_{θ} (\overset{\land}{x_{k}}, k))

(7)

where

μ_{θ} (\overset{\land}{x_{k}}, k)

and

\sum_{θ} (\overset{\land}{x_{k}}, k)

are the parameters learned through the diffusion model.

The reverse process

f_{D D P M}

aims to reverse the noise addition in the forward process as much as possible to recover the original data

\overset{\land}{x_{0}}

:

\overset{\land}{x_{0}} = f_{D D P M} (x_{0})

(8)

The true distribution

q (\overset{\land}{x_{k - 1}} | \overset{\land}{x_{k}}, x_{0})

is derived from the forward process, representing the transfer of

\overset{\land}{x_{k}}

to

\overset{\land}{x_{k - 1}}

. Given

x_{k}

and

x_{0}

, the true distribution

q (\overset{\land}{x_{k - 1}} | \overset{\land}{x_{k}}, x_{0})

is a Gaussian distribution:

q (\overset{\land}{x_{k - 1}} | \overset{\land}{x_{k}}, x_{0}) = N (\overset{\land}{x_{k - 1}}; μ_{t} (\overset{\land}{x_{k}}, x_{0}), β_{k} I)

(9)

The negative log-likelihood of the forward process is minimized by adjusting the parameters

θ

. During the training process, the variational lower bound

L (θ)

is used for optimization.

L (θ) = E_{q} [\sum_{k = 1}^{K} D_{KL} (q (\overset{\land}{x_{k - 1}} | \overset{\land}{x_{k}}, x_{0}) ∥ p_{θ} (\overset{\land}{x_{k - 1}} | \overset{\land}{x_{k}}))]

(10)

where

D_{KL}

is the Kullback–Leibler divergence, which is used to measure the difference between the true distribution

q (x_{k - 1} | x_{k}, x_{0})

and the generated distribution

p_{θ} (\overset{\land}{x_{k - 1}} | \overset{\land}{x_{k}})

.

Through this procedure, the user interest preference model based on the denoising diffusion model learns to recover the underlying data distribution by denoising corrupted inputs, progressively refining its output toward the true distribution of users’ preferences during optimization.

3.4. Surrogate Model-Driven Adaptive Estimation of Distribution Algorithm

This approach employs matrix factorization to derive latent representations of users and items, which are used to build a surrogate model for evaluating the fitness of individuals. It then incorporates an Estimation of Distribution Algorithm (EDA) to develop a probabilistic sampling model with an adaptive strategy. This mechanism boosts data utility, thereby guiding the personalized evolutionary search process for the personalized recommendation algorithm.

The learning representation

x

,

x \in P_{a}

of the item individual is fed into the trained user interest preference model based on the denoising diffusion model to obtain the implicit representation

u_{p}

of users.

u_{p} = f_{D D P M} (x)

(11)

According to the matrix factorization model, the predicted preference value

f (X_{u, v})

of the current user on items is estimated by using the implicit representation

u_{p}

of the current user

u

and the learning representation of the item individual

v

:

f (X_{u, v}) = {\overset{\land}{R}}_{D D P M} = u_{p} \cdot x + b_{u} + b_{x} + μ_{a l l}

(12)

where

b_{u}

and

b_{x}

are, respectively, the bias of users and items;

μ_{a l l}

is the average rating of all samples.

The surrogate model

f (X_{u, v})

served as the fitness function to guide the personalized evolutionary search in the estimation of distribution algorithm. The dominant population

P_{a}

is taken as the initial population

P (t = 0)

. The denoising diffusion probability model

P_{D D P M} (x_{i})

is obtained through the user interest preference model based on the denoising diffusion model.

P_{D D P M} (x_{i}) = t a n h (\sum_{j} W_{i j} h_{j} + b_{i})

(13)

where

W_{i j}

and

b_{i}

are, respectively, the weight and bias of the denoising diffusion probability model;

h_{j}

is the neural unit of the hidden layer of that.

The population is fed into the denoising diffusion probability model to obtain the reconstructed representation of individual items. Then, the sampling probability model

G (x)

in EDA is established by the reconstructed representation of individual items:

G (x) = \frac{2 π^{- n / 2}}{{(\det C)}^{1 / 2}} \exp (- (x - \bar{μ})^{T} {(C)}^{- 1} (x - \bar{μ}) / 2)

(14)

where

\bar{μ}

and

C

are, respectively, the mean and covariance matrix element of the reconstructed representation of individual items.

According to the elite selection strategy and the fitness function, the evolutionary individuals with higher fitness are selected to generate a subpopulation

P_{s r} (t) = s r * P (t)

, where

s r

is the selection ratio. The mean

\bar{μ} (t)

is calculated by the maximum likelihood estimation of

P_{s r} (t)

.

\bar{μ} (t) = \frac{1}{|P_{s r} (t)|} \sum_{i = 1}^{|P_{s r} (t)|} P_{s r} {(t)}_{i}

(15)

The mean

\bar{μ} (t)

mainly controls the center of sampling offspring. An excessively high selection ratio can displace the population mean too far from the optimal fitness region. While this enhances diversity, it slows down convergence in later stages. Conversely, an overly small ratio pulls the mean closer to the optimum, which favors local exploitation but risks premature convergence. Here, an adaptive strategy of

s r

is designed as follows:

s r = s r_{m a x} - (s r_{m a x} - s r_{m i n}) {(\frac{F E s}{F E s_{m a x}})}^{0.1}

(16)

where

s r_{m a x}

and

s r_{m i n}

represent the maximum and minimum selection ratios;

F E s_{m a x}

indicates the maximum number of fitness function evaluations;

F E s

denotes the number of fitness function evaluations up to the current generation.

According to

s c = c s * P (t)

, a subpopulation

P_{s c} (t)

is formed, where

c s

is the covariance scaling parameter,

c s > s r

. An archive set

A (t) = P_{s c} (t - 1) \cup P_{s c} (t - 2) \dots \cup P_{s c} (t - l)

is obtained by storing each generation’s subpopulation, where

l

is the set length of the archive set.

H (t) = A (t) \cup P_{s c} (t)

is obtained by combining the current subpopulation. The covariance

C (t)

is estimated to select more individuals with higher fitness.

C (t) = \frac{1}{|H (t)|} \sum_{i = 1}^{|H (t)|} (H {(t)}_{i} - \bar{μ} (t)) {(H {(t)}_{i} - \bar{μ} (t))}^{T}

(17)

While an overly large

c s

facilitates a wide sampling range—beneficial for early-stage diversity—it hinders focused search during later stages. Conversely, an overly small

c s

restricts the sampling range, which is suitable for late-stage local exploitation but risks premature convergence. Therefore, the

c s

value is dynamically adjusted throughout the evolution of the estimation of distribution algorithm. Here, an adaptive strategy of

c s

is calculated as follows:

c s = 1 - (1 - s r_{\min}) {(\frac{F E s}{F E s_{m a x}})}^{2}

(18)

The proposed adaptive Estimation of Distribution Algorithm dynamically adjusts the selection ratio and covariance scaling parameter by leveraging the historical information from the archive set

A (t)

. This framework maintains strong global exploration capabilities during early evolution and later precisely converges to optimal regions aligning with users’ interest preferences, thereby achieving a dynamic balance in the personalized search process.

New individuals are generated by the sampling probabilistic model in EDA. Based on the Pearson similarity criterion, the similarity between these new individuals and real items in the feasible solution space is computed. The most similar items are then selected to replace the evolutionary individuals, forming a candidate recommendation set

s_{u}

. The surrogate model estimates the fitness of each candidate. Finally, through an elite selection strategy, the Top-N items are recommended to the user, completing one interactive recommendation cycle. The population is subsequently updated using the user’s feedback to initiate the next round of the interactive personalized evolutionary search.

3.5. Model Dynamic Management Mechanism

To address the diversity and time-varying nature of user interests in complex networks, we design a dynamic model management mechanism. This closed-loop feedback system enables continuous model evolution and strategy optimization. When environmental shifts cause model prediction accuracy to fall below a preset threshold, it triggers a collaborative update of the user preference model and its related models. These models and parameters are then dynamically refined using new user-generated content, thereby promptly tracking user interests and monitoring accuracy. This process guides the personalized search toward satisfactory solutions, ultimately completing the recommendation task.

3.6. Algorithm Implementation and Computational Complexity Analysis

The specific implementation steps of the proposed algorithm are pseudocoded as follows (Algorithm 1):

Algorithm 1: DDM-AEDA

Input: Multi-modal UGCs
Output: Top-N Item recommendation list
Start

1. Multi-modal data preprocessing: Multi-modal user-generated contents are processed as described in Section 3.2;
2. Pre-trained models: User-generated contents are collected to train pre-trained doc2vec and ResNet models for representation learning;
3. Initialization: In the search space, an initial dominant group is formed by filtering a set of items aligning with user preferences derived from user-generated data;
Do while (The algorithm termination condition has not been met)
4. User interest preference model: According to the method in Section 3.3, a user interest preference model based on a denoising diffusion model is constructed and trained to extract users’ preference features;
5. Surrogate model: The user preference surrogate model is designed using the formula (12) $f (X_{u, v}) = {\overset{\land}{R}}_{D D P M} = u_{p} \cdot x + b_{u} + b_{x} + μ_{a l l}$ ;
6. Probability model: A sampling probability model is presented by using the formula (14) $G (x) = \frac{2 π^{- n / 2}}{{(\det C)}^{1 / 2}} \exp (- (x - \bar{μ})^{T} {(C)}^{- 1} (x - \bar{μ}) / 2)$ ;
7. Population update: New individuals with user preferences are generated through the probabilistic model to form a set of items to be recommended, and the fitness of individuals is estimated by the surrogate model;
8. Recommendation list: Based on the elite selection strategy, N individual items with higher rating are selected to generate an item recommendation list that user may be interested in;
9. Interactive evaluations: The item recommendation list is submitted to the current user for interactive evaluations. Based on this feedback, the algorithm checks if the termination criteria are met. If so, it concludes and outputs the final result; otherwise, the dominant group is updated with new data for the next iteration.
10. Model dynamic management: The accuracy of the user preference surrogate model is assessed. If its average accuracy falls below the predefined threshold, the process advances to Step 4 to track user preferences. Otherwise, it proceeds to Step 9 to conduct the iterative evolutionary search.

End Do
End

The computational complexity of the denoising diffusion model-driven adaptive estimation of distribution algorithm, integrating multi-modal data, primarily consists of five components: the vectorized representation of text comments, the vectorized representation of image features, training the user interest preference model, screening the recommended item set, and predicting items’ scores. The vectorized representation of text and images is obtained through offline computing. The computational complexity of training the user interest preference model is

O (|P_{a}| \times (n_{1} + n_{2} + n_{3}) \times m)

. The time cost of filtering the recommended item set is

O (s_{u} \times |V|)

, where

|V|

is the total number of individual items in the feasible solution space. The time consumption for predicting items’ scores is

O (s_{u})

. Consequently, the overall computational complexity of the proposed algorithm is

O (|P_{a}| \times (n_{1} + n_{2} + n_{3}) \times m + s_{u} \times |V|)

.

4. Experimental Results and Analysis

4.1. Experimental Environment

To demonstrate the comprehensive performance of the proposed algorithm, we conducted experiments on two public datasets: the Amazon dataset [42] (from Prof. Julian McAuley’s team at the University of California, San Diego) and the Yelp dataset [43]. The statistical details of each dataset are provided in Table 1.

The experimental environment is configured with an Intel Core i5-4590 CPU at 3.30 GHz and 4 GB RAM. The experimental platform is developed using Python 3.11. In the experiment, some evaluation indicators, such as Root Mean Square Error (RMSE), Hits Ratio (HR), mean Average Precision (mAP), and Normalized Discounted Cumulative Gain (NDCG), were used to evaluate the performance of personalized search and recommendation algorithms. RMSE measures the scoring prediction ability of personalized search algorithms. HR, mAP, and NDCG measure the ability of personalized recommendation algorithms to predict users’ preferences on items for high-quality recommendations, reflecting user satisfaction and usage experience. Those evaluation indicators demonstrate the prediction accuracy and recommendation performance of personalized search and recommendation algorithms.

4.2. Comprehensive Performance Comparison Experiments

To verify the effectiveness of the proposed algorithm in this paper, some comparative experiments are conducted with TruthSR [44], MMSSL [45], and TiM4Rec [46]. The brief introduction of these comparative algorithms is as follows:

TruthSR: Capturing the consistency and complementarity of user-generated contents to reduce noise interference, the prediction credibility is dynamically evaluated by combining subjective and objective perspectives to conduct personalized recommendations.
MMSSL: An interaction structure between user-item collaborative view and multi-modal semantic view is constructed through anti-perturbation enhanced data. Meanwhile, cross-modal contrastive learning is utilized to capture user preferences and semantic commonalities for the diversity of preferences.
TiM4Rec: By introducing a time-aware structured mask matrix, the time information is integrated into a state space framework to conduct a time-aware Mamba recommendation algorithm.

In the experiment, some evaluation metrics, such as RMSE, HR@10, mAP@10, and NDCG@10, are used to measure the comprehensive performance of those algorithms. Each algorithm was independently run 10 times. The average experimental results are shown in Table 2.

By observing the experimental results, the following conclusions are drawn:

(1): In personalized search and recommendation algorithms, the proposed DDM-AEDA algorithm has demonstrated excellent prediction accuracy and recommendation performance. Specifically, the RMSE values achieved by DDM-AEDA are predominantly superior to those of other benchmark algorithms. For instance, on the Amazon-Beauty dataset, DDM-AEDA attained an optimal RMSE of 1.120, which is 28.53% lower than that of the second-best MMSSL algorithm, 37.40% lower than TruthSR, and 30.65% lower than TiM4Rec. These comparative results underscore the strong predictive capability of the proposed method. It indicates that DDM-AEDA can effectively capture users’ dynamic preferences by leveraging the powerful modeling ability of DDPM combined with multi-step generative modeling. Furthermore, DDM-AEDA mitigates interference from multimodal noise, enabling more comprehensive modeling of complex data distributions. This leads to a more accurate alignment with user preference behaviors, thereby guiding personalized search tasks.
(2): The proposed DDM-AEDA algorithm improves the ranking of items in search results by arranging them in a way that better aligns with users’ interest preferences. It prioritizes items that users are likely to be interested in, placing them at the front of the recommendation list. This enhances the search and browsing experience, leading to a higher hit rate and better average accuracy. For example, in experiments on the Yelp dataset, DDM-AEDA achieved optimal performance in HR@10, mAP@10, and NDCG@10 compared to other methods. Specifically, the HR@10 value of DDM-AEDA is 23.24% higher than that of TiM4Rec (the suboptimal method), 59.68% higher than TruthSR, and 50.76% higher than MMSSL. The mAP@10 value is 3.82% better than TiM4Rec, 29.11% higher than TruthSR, and 18.21% higher than MMSSL. The NDCG@10 value is 13.15% higher than TruthSR, 28.13% higher than MMSSL, and 58.48% higher than TiM4Rec. Overall, DDM-AEDA integrates multi-modal information, such as users’ historical behaviors, text, tags, and images, to more accurately model user preferences and capture the characteristics of multi-modal data. It has enabled highly relevant items to be ranked better to strengthen personalized recommendation capabilities, leading to superior recommendation performance and increased user satisfaction.

In order to better demonstrate the comprehensive performance of the proposed DDM-AEDA algorithm in personalized search and recommendation tasks, DDM-AEDA was compared with other IECs algorithms on various datasets, such as multi-layer perceptron-driven IEDA (MLPIEDA), RBM-MSH-assisted IEDA (RIEDA_MsH) [47], and enhanced interactive estimation of distribution algorithm driven by dual sparse variational autoencoders integrating user-generated content (DSVAE-IEDA) [48]. In the experiments, each algorithm was independently run 10 times. The average evaluation indicators are calculated to measure the comprehensive performance of those algorithms. The average experimental results are shown in Table 3.

By observing the above experimental results, the following conclusions are drawn:

(1): The proposed DDM-AEDA algorithm makes full use of multi-modal user-generated contents to construct both a user preference model and a surrogate model based on user preferences. It generates new evolutionary individuals that reflect user interests and estimates the fitness values of new individual items, thereby guiding the personalized search and recommendation process within an interactive evolutionary computation (IEC) framework. In personalized search experiments on various datasets, DDM-AEDA demonstrated superior prediction accuracy and recommendation performance compared to other algorithms. For example, on the Yelp dataset, DDM-AEDA achieved overall optimal evaluation metrics. Specifically, the average RMSE of DDM-AEDA is 3.91% lower than that of the suboptimal algorithm (DSVAEIEDA), and 27.74% and 35.75% lower than those of MLPIEDA and RIEDA-MsH, respectively. The average HR@10 of DDM-AEDA is 42.11% higher than that of the suboptimal algorithm (DSVAEIEDA), and 115.22% and 68.75% higher than those of MLPIEDA and RIEDA-MsH, respectively. The average mAP@10 of DDM-AEDA is 4.92% higher than that of DSVAEIEDA, and 36.38% and 23.93% higher than those of MLPIEDA and RIEDA-MsH, respectively. The average NDCG@10 of DDM-AEDA is 8.84% higher than that of DSVAEIEDA, and 33.17% and 36.87% higher than those of MLPIEDA and RIEDA-MsH, respectively. Although the proposed algorithm did not achieve optimal results on every evaluation metric across all datasets, it still delivered strong comprehensive search performance and recommendation quality.
(2): In the comparative experiments of various datasets, DDM-AEDA generally outperforms other algorithms, demonstrating its feasibility, effectiveness, and strong performance in prediction accuracy, search efficiency, and recommendation quality. These results indicate that the proposed algorithm successfully integrates diverse information sources through an interactive adaptive optimization strategy within the IEDA evolutionary optimization framework, enabling accurate modeling of users’ interest preferences. This approach facilitates the generation of high-quality evolutionary individuals, helping to preserve the relative ranking relationship within the dominant group. By leveraging knowledge extracted from high-performing solutions, the proposed algorithm effectively generates new items that align with users’ needs and preferences. These promising individuals are selectively retained in subsequent populations, guiding the personalized evolutionary search process and reducing the risk of convergence to local optima. Simultaneously, a surrogate model based on user preferences predicts item ratings, ensuring that preferred solutions are ranked at the top of the recommendation list. Items that match user interests are selected for rapid recommendation, enabling efficient identification of satisfactory solutions. These strategies enhance the rating prediction capability and overall recommendation performance of the personalized search and recommendation algorithm.

In summary, for personalized search and recommendation tasks, DDM-AEDA fully leverages multi-modal user-generated contents to construct a user interest preference model based on a denoising diffusion model. This approach effectively uncovers deep-seated potential user preferences and captures their evolutionary patterns, thereby improving the fitting accuracy of the user interest model. An adaptive EDA probabilistic model and a user preference-based surrogate model are established within the interactive evolutionary computation framework, enhancing the interactive personalized evolutionary search process and increasing the prediction accuracy of the user evaluation surrogate model. Furthermore, the proposed algorithm guides the direction of personalized evolutionary search using objective assessment indicators—such as user experience and feedback evaluation—enabling users to efficiently locate satisfactory solutions. It achieves an effective balance between recommendation quality and search efficiency. As a result, DDM-AEDA improves the optimization capability, search efficiency, and recommendation effectiveness of personalized search algorithms, while exhibiting strong stability and scalability. It is well-suited to meet the practical demands of personalized search and recommendation in complex multi-modal data environments.

4.3. Ablation Experiments

To evaluate the contribution of its core components, we performed ablation experiments on the proposed DDM-AEDA algorithm, focusing on the multi-modal feature fusion and adaptive distribution estimation strategy modules. Experimental results on the Amazon-Beauty dataset (Figure 3) demonstrate their importance. The model configurations excluding the visual features and the adaptive strategy are referred to as “w/o Visual” and “w/o Adaptive”, respectively.

The ablation study reveals that the removal of the visual feature input module markedly degrades the comprehensive performance of DDM-AEDA. This module is designed to enhance the richness of item representation by processing visual information (color, texture, shape, etc.), thereby enabling the user preference model to account for visual tastes. This function is particularly crucial in highly visual domains such as beauty and clothing. When ablated, the model loses the ability to discriminate between items with similar appearances—for instance, lipsticks of different shades or similar sports shoes—thus significantly reducing the recommendation hit rate and ranking quality. Therefore, the visual feature input is indispensable for achieving high recommendation accuracy and user satisfaction.

Conversely, the ablation of the adaptive distribution estimation strategy also leads to a notable decline in overall performance. In the interactive EDA framework, DDM-AEDA employs this strategy to balance the search process by adapting the selection ratio (sr) and covariance scaling (cs) parameters based on historical information. When this adaptive mechanism is disabled—as in the “w/o Adaptive” experiment where sr and cs were fixed and the archive update was turned off—the model can no longer adjust its search direction in response to user feedback. This failure significantly undermines both the diversity of the recommendation results and the convergence speed, confirming that the adaptive strategy is essential for enhancing the search efficiency and recommendation performance of personalized search and recommendation algorithms.

4.4. Hyperparameter Sensitivity Analysis

The performance and efficiency of the proposed DDM-AEDA algorithm are highly sensitive to its hyperparameters. To assess this sensitivity, we conduct experiments focusing on two key parameters: the selection ratio (sr) and the covariance scaling (cs) parameters.

In our experiments, sr was varied across

[0.1, 0.2, 0.3, 0.4]

. cs was varied across

[0.3, 0.6, 0.9]

when sr = 0.1. cs was varied across

[0.2, 0.4, 0.6, 0.8]

when sr = 0.2. cs was varied across

[0.3, 0.6, 0.9]

when sr = 0.3. cs was varied across

[0.4, 0.8]

when sr = 0.4. We employed HR and NDCG as evaluation metrics. The results of this sensitivity analysis on the Amazon-Beauty dataset are presented in Figure 4.

The experimental results reveal that with a fixed sr, both HR and NDCG metrics initially increase and then decrease as cs grows, forming a smooth curve. This trend indicates that a moderate cs value yields the best recommendation performance. An optimal balance is thus critical: a large cs promotes diversity in early-stage search but hinders convergence to optimal solutions later, while a too-small cs, despite aiding later-stage focus, risks early convergence to local optima. Therefore, we configure the DDM-AEDA algorithm with sr = 0.2 and cs = 0.4.

5. Conclusions

This paper addresses personalized search and recommendation in complex network environments by proposing a denoising diffusion model-driven adaptive estimation of a distribution algorithm integrating multi-modal data. The approach combines user interest preference modeling with a surrogate-assisted IEC. Specifically, a user interest preference model is constructed using a denoising diffusion model, incorporating multi-modal user-generated contents. Within the estimation of distribution algorithm framework, a user preference-based surrogate model is established, alongside adaptive operators and strategies designed to guide the personalized evolutionary search. A dynamic model management mechanism is also introduced to track shifts in user interest preferences. Extensive experiments demonstrate that the proposed method outperforms existing state-of-the-art algorithms by approximately 5–23% in HR@10, 4–5% in mAP@10, and 5–14% in NDCG@10. The algorithm is validated to offer advantages in functional completeness, prediction accuracy, and decision transparency, thereby enhancing both the personalized search experience for users and the overall performance of the recommendation system. Future work will focus on improving computational efficiency, ensuring system security, and strengthening user privacy protection to deliver intelligent, exclusive, and secure personalized services.

Author Contributions

Conceptualization, L.B. and H.Y.; methodology, L.B. and H.Y.; software, L.W. and H.Y.; validation, L.W. and H.Y.; formal analysis, L.W. and Y.P.; investigation, B.X.; resources, B.X.; data curation, Y.P.; writing—original draft preparation, L.B. and H.Y.; writing—review and editing, L.B. and L.W.; visualization, L.B. and L.W.; supervision, L.B. and B.X.; project administration, L.B.; funding acquisition, B.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Guangdong Province of China (2023B1515120020, 2024A1515012450). This work was supported by the National Natural Science Foundation of China under grant No. 61876184.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We are grateful to the Large Language Model (LLM) for its assistance in refining the language and expression of this paper. We also extend our thanks to the reviewers for their insightful comments and valuable suggestions, which have significantly enhanced the quality of our work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Raza, S.; Ding, C. News recommender system: A review of recent progress, challenges, and opportunities. Artif. Intell. Rev. 2022, 55, 749–800. [Google Scholar] [CrossRef]
Etemadi, M.; Abkenar, S.B.; Ahmadzadeh, A.; Kashani, M.H.; Asghari, P.; Akbari, M.; Mahdipour, E. A systematic review of healthcare recommender systems: Open issues, challenges, and techniques. Expert Syst. Appl. 2023, 213, 118823. [Google Scholar] [CrossRef]
Deldjoo, Y.; He, Z.; McAuley, J.; Korikov, A.; Sanner, S.; Ramisa, A.; Vidal, R.; Sathiamoorthy, M.; Kasirzadeh, A.; Milano, S. A review of modern recommender systems using generative models (Gen-RecSys). In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), Barcelona, Spain, 25–29 August 2024; ACM: New York, NY, USA, 2024; pp. 6448–6458. [Google Scholar] [CrossRef]
Zou, F.; Chen, D.; Xu, Q.; Jiang, Z.; Kang, J. A two-stage personalized recommendation based on multi-objective teaching–learning-based optimization with decomposition. Neurocomputing 2021, 452, 716–727. [Google Scholar] [CrossRef]
Qi, L.; Lin, W.; Zhang, X.; Dou, W.; Xu, X.; Chen, J. A correlation graph based approach for personalized and compatible web apis recommendation in mobile app development. IEEE Trans. Knowl. Data Eng. 2022, 35, 5444–5457. [Google Scholar] [CrossRef]
Vullam, N.; Vellela, S.S.; Reddy, V.; Rao, M.V.; SK, K.B. Multi-agent personalized recommendation system in e-commerce based on user. In Proceedings of the 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 4–6 May 2023; pp. 1194–1199. [Google Scholar] [CrossRef]
Liang, K.; Liu, H.; Shan, M.; Zhao, J.; Li, X.; Zhou, L. Enhancing scenic recommendation and tour route personalization in tourism using UGC text mining. Appl. Intell. 2024, 54, 1063–1098. [Google Scholar] [CrossRef]
Cai, D.; Qian, S.; Fang, Q.; Hu, J.; Ding, W.; Xu, C. Heterogeneous graph contrastive learning network for personalized micro-video recommendation. IEEE Trans. Multimed. 2022, 25, 2761–2773. [Google Scholar] [CrossRef]
El-Kishky, A.; Markovich, T.; Park, S.; Verma, C.; Kim, B.; Eskander, R.; Malkov, Y.; Portman, F.; Samaniego, S.; Xiao, Y.; et al. Twhin: Embedding the twitter heterogeneous information network for personalized recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 2842–2850. [Google Scholar]
Hui, B.; Zhang, L.; Zhou, X.; Wen, X.; Nian, Y. Personalized recommendation system based on knowledge embedding and historical behavior. Appl. Intell. 2022, 52, 954–966. [Google Scholar] [CrossRef]
Mu, Y.; Wu, Y. Multimodal movie recommendation system using deep learning. Mathematics 2023, 11, 895. [Google Scholar] [CrossRef]
Linden, G.; Smith, B.; York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 2003, 7, 76–80. [Google Scholar] [CrossRef]
Wang, J.; De Vries, A.P.; Reinders, M.J.T. Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 6–11 August 2006; ACM: New York, NY, USA, 2006; pp. 501–508. [Google Scholar] [CrossRef]
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–21 June 2009; pp. 452–461. [Google Scholar]
Qiu, H.; Liu, Y.; Guo, G.; Sun, Z.; Zhang, J.; Nguyen, H.T. BPRH: Bayesian personalized ranking for heterogeneous implicit feedback. Inf. Sci. 2018, 453, 80–98. [Google Scholar] [CrossRef]
Rendle, S. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol. (TIST) 2012, 3, 1–22. [Google Scholar] [CrossRef]
Pang, G.; Wang, X.; Hao, F.; Xie, J.; Wang, X.; Lin, Y.; Qin, X. ACNN-FM: A novel recommender with attention-based convolutional neural network and factorization machines. Knowl.-Based Syst. 2019, 181, 104786. [Google Scholar] [CrossRef]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar] [CrossRef]
He, R.; McAuley, J. VBPR: Visual bayesian personalized ranking from implicit feedback. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30, pp. 144–150. [Google Scholar] [CrossRef]
Kim, D.; Park, C.; Oh, J.; Lee, S.; Yu, H. Convolutional matrix factorization for document context-aware recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 233–240. [Google Scholar] [CrossRef]
Chen, J.; Zhang, H.; He, X.; Nie, L.; Liu, W.; Chua, T.S. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 335–344. [Google Scholar] [CrossRef]
Wei, Y.; Wang, X.; Nie, L.; He, X.; Hong, R.; Chua, T.S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1437–1445. [Google Scholar] [CrossRef]
Wang, Z.; Chen, H.; Li, Z.; Lin, K.; Jiang, N.; Xia, F. VRConvMF: Visual recurrent convolutional matrix factorization for movie recommendation. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 519–529. [Google Scholar] [CrossRef]
Yang, M.; Zhou, P.; Li, S.; Zhang, Y.; Hu, J.; Zhang, A. Multi-head multimodal deep interest recommendation network. Knowl.-Based Syst. 2023, 276, 110689. [Google Scholar] [CrossRef]
Deng, J.; Wang, S.; Wang, Y.; Qi, J.; Zhao, L.; Zhou, G.; Meng, G. MMBee: Live Streaming Gift-Sending Recommendations via Multi-Modal Fusion and Behaviour Expansion. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 4896–4905. [Google Scholar] [CrossRef]
Yan, B.; Chen, S.; Jia, S.; Liu, J.; Liu, Y.; Fu, C.; Guan, W.; Zhao, H.; Zhang, X.; Zhang, K.; et al. MIM: Multi-modal Content Interest Modeling Paradigm for User Behavior Modeling. arXiv 2025, arXiv:2502.00321. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video diffusion models. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2022; Volume 35, pp. 8633–8646. [Google Scholar] [CrossRef]
Croitoru, F.-A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef]
Chen, S.; Sun, P.; Song, Y.; Luo, P. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 19830–19843. [Google Scholar] [CrossRef]
Cao, H.; Tan, C.; Gao, Z.; Xu, Y.; Chen, G.; Heng, P.A.; Li, S.Z. A survey on generative diffusion models. IEEE Trans. Knowl. Data Eng. 2024, 36, 2814–2830. [Google Scholar] [CrossRef]
Chen, H.; Zhang, Y.; Cun, X.; Xia, M.; Wang, X.; Weng, C.; Shan, Y. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7310–7320. [Google Scholar] [CrossRef]
Wang, W.; Xu, Y.; Feng, F.; Lin, X.; He, X.; Chua, T.-S. Diffusion recommender model. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 832–841. [Google Scholar] [CrossRef]
Yang, Z.; Wu, J.; Wang, Z.; Wang, X.; Yuan, Y.; He, X. Generate what you prefer: Reshaping sequential recommendation via guided diffusion. Adv. Neural Inf. Process. Syst. 2023, 36, 24247–24261. [Google Scholar]
Li, Z.; Xia, L.; Huang, C. Recdiff: Diffusion model for social recommendation. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 1346–1355. [Google Scholar] [CrossRef]
Wei, T.R.; Fang, Y. Diffusion Models in Recommendation Systems: A Survey. arXiv 2025, arXiv:2501.10548. [Google Scholar] [CrossRef]
Li, Z.; Sun, A.; Li, C. Diffurec: A diffusion model for sequential recommendation. ACM Trans. Inf. Syst. 2023, 42, 1–28. [Google Scholar] [CrossRef]
Zhao, J.; Wenjie, W.; Xu, Y.; Sun, T.; Feng, F.; Chua, T.-S. Denoising diffusion recommender model. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), Washington, DC, USA, 14–18 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1370–1379. [Google Scholar]
Jiang, Y.; Yang, Y.; Xia, L.; Huang, C. DiffKG: Knowledge graph diffusion model for recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24), Merida, Mexico, 4–8 March 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 313–321. [Google Scholar] [CrossRef]
Cui, Z.; Wu, H.; He, B.; Cheng, J.; Ma, C. Context Matters: Enhancing Sequential Recommendation with Context-aware Diffusion-based Contrastive Learning. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 404–414. [Google Scholar] [CrossRef]
Xia, R.; Cheng, Y.; Tang, Y.; Liu, X.; Liu, X.; Wang, L.; Jiang, P. S-diff: An anisotropic diffusion model for collaborative filtering in spectral domain. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM ’25), Hannover, Germany, 10–14 March 2025; Association for Computing Machinery: New York, NY, USA; pp. 70–78. [Google Scholar] [CrossRef]
Brett, K. ResNet 50. In Convolutional Neural Networks with Swift for TensorFlow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 63–72. [Google Scholar]
Yue, Z.; Wang, Y.; He, Z.; Zeng, H.; Mcauley, J.; Wang, D. Linear recurrent units for sequential recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida, Mexico, 4–8 March 2024; pp. 930–938. [Google Scholar]
Wu, J.; Wang, X.; Feng, F.; He, X.; Chen, L.; Lian, J.; Xie, X. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021; pp. 726–735. [Google Scholar] [CrossRef]
Yan, M.; Huang, H.; Liu, Y.; Zhao, J.; Gao, X.; Xu, C.; Guan, Z.; Zhao, W. Truthsr: Trustworthy sequential recommender systems via user-generated multimodal content. In Proceedings of the International Conference on Database Systems for Advanced Applications, Gifu, Japan, 2–5 July 2024; Springer Nature: Singapore, 2024; pp. 180–195. [Google Scholar]
Wei, W.; Huang, C.; Xia, L.; Zhang, C. Multi-modal self-supervised learning for recommendation. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 790–800. [Google Scholar] [CrossRef]
Fan, H.; Zhu, M.; Hu, Y.; Feng, H.; He, Z.; Liu, H.; Liu, Q. TiM4Rec: An efficient sequential recommendation model based on time-aware structured state space duality model. Neurocomputing 2025, 654, 131270. [Google Scholar] [CrossRef]
Bao, L.; Sun, X.; Gong, D.; Zhang, Y. Multisource heterogeneous user-generated contents-driven interactive estimation of distribution algorithms for personalized search. IEEE Trans. Evol. Comput. 2021, 26, 844–858. [Google Scholar] [CrossRef]
Yang, H.; Bao, L.; Sun, X.; Peng, Y. Dual Sparse Variational Autoencoder-driven Interactive Estimation of Distribution Algorithm with User Generated Contents. In Proceedings of the 2024 International Conference on New Trends in Computational Intelligence (NTCI), Qingdao, China, 18–20 October 2024; pp. 172–177. [Google Scholar] [CrossRef]

Figure 1. Algorithmic framework.

Figure 2. User interest preference model based on denoising diffusion model.

Figure 3. Results of the ablation experiments.

Figure 4. Hyperparameter sensitivity analysis experiments in Amazon-Beauty.

Table 1. Statistical Information of Datasets.

Dataset	#Users	#Items	#Interactions	#Sparsity
Amazon-Beauty	22,363	12,101	198,502	99.927%
Amazon-Sports	35,598	18,357	296,337	99.955%
Yelp	31,668	38,048	1,561,406	99.870%

Table 2. Comparative Experimental Results (The optimal results were bolded, and Improve-x represents the performance comparison between DDM-AEDA and the suboptimal algorithm).

Algorithm		TruthSR	MMSSL	TiM4Rec	DDM-AEDA	Improve-x
Amazon-Beauty	RMSE	1.789	1.567	1.615	1.120	−28.53%
	HR@10	0.0652	0.0709	0.0838	0.0878	4.77%
	mAP@10	0.708	0.765	0.812	0.843	3.82%
	NDCG@10	0.0479	0.0487	0.0446	0.0554	13.76%
Amazon-Sports	RMSE	1.533	1.489	1.388	1.477	6.41%
	HR@10	0.0678	0.0689	0.0587	0.0786	14.08%
	mAP@10	0.627	0.731	0.822	0.864	5.11%
	NDCG@10	0.0342	0.0371	0.0401	0.0423	5.49%
Yelp	RMSE	1.654	1.358	1.874	1.107	−18.48%
	HR@10	0.0186	0.0197	0.0241	0.0297	23.24%
	mAP@10	0.694	0.758	0.863	0.896	3.82%
	NDCG@10	0.0479	0.0423	0.0342	0.0542	13.15%

Table 3. Comparative experimental results of DDM-AEDA with other IECs (the optimal results were bolded, and Improve-y represents the performance comparison between DDM-AEDA and the suboptimal algorithm).

Algorithm		MLPIEDA	RIEDA-MsH	DSVAEIEDA	DDM-AEDA	Improve-y
Amazon- Beauty	RMSE	2.200	1.584	1.261	1.120	−11.18%
	HR@10	0.0603	0.0750	0.0781	0.0878	12.42%
	mAP@10	0.650	0.763	0.860	0.843	−1.98%
	NDCG@10	0.0398	0.0330	0.0458	0.0554	20.96%
Amazon-Sports	RMSE	1.585	1.394	1.751	1.477	−5.95%
	HR@10	0.0705	0.0569	0.0766	0.0786	2.61%
	mAP@10	0.669	0.711	0.894	0.864	−3.36%
	NDCG@10	0.0361	0.0356	0.0388	0.0423	9.02%
Yelp	RMSE	1.532	1.723	1.152	1.107	−3.91%
	HR@10	0.0138	0.0176	0.0209	0.0297	42.11%
	mAP@10	0.657	0.723	0.854	0.896	4.92%
	NDCG@10	0.0407	0.0396	0.0498	0.0542	8.84%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bao, L.; Wang, L.; Xu, B.; Yang, H.; Peng, Y. Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data. Mathematics 2025, 13, 3777. https://doi.org/10.3390/math13233777

AMA Style

Bao L, Wang L, Xu B, Yang H, Peng Y. Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data. Mathematics. 2025; 13(23):3777. https://doi.org/10.3390/math13233777

Chicago/Turabian Style

Bao, Lin, Lina Wang, Biao Xu, Hang Yang, and Yumeng Peng. 2025. "Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data" Mathematics 13, no. 23: 3777. https://doi.org/10.3390/math13233777

APA Style

Bao, L., Wang, L., Xu, B., Yang, H., & Peng, Y. (2025). Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data. Mathematics, 13(23), 3777. https://doi.org/10.3390/math13233777

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data

Abstract

1. Introduction

2. Related Work

2.1. Mathematical Description for Personalized Search Problems with User-Generated Contents

2.2. Recommendation Algorithms Integrating Multi-Modal Data

2.3. Diffusion Models

3. Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data

3.1. The Proposed Algorithm Framework

3.2. Multi-Modal Data Processing and Its Fusion Learning Representation

3.3. User Interest Preference Model Based on Denoising Diffusion Model

3.4. Surrogate Model-Driven Adaptive Estimation of Distribution Algorithm

3.5. Model Dynamic Management Mechanism

3.6. Algorithm Implementation and Computational Complexity Analysis

4. Experimental Results and Analysis

4.1. Experimental Environment

4.2. Comprehensive Performance Comparison Experiments

4.3. Ablation Experiments

4.4. Hyperparameter Sensitivity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI