1. Introduction
With the rapid development of information technology, the internet, big data, and other technologies, network information is growing explosively. It is almost impossible to control the generation and dissemination of information in the digital age. The sources, types, and forms of information have become diverse and differentiated. At the same time, the growth in the number of users and their generated data led to the emergence of new ways of connection, collaboration, and sharing in the network. The scale and complexity of data have reached an unprecedented level, giving rise to the problem of information overload [
1,
2,
3]. In numerous fields such as e-commerce, social media, news information, and video platforms, a vast amount of product information is constantly emerging and overwhelming. Users’ available information has far exceeded their processing capacity, making it difficult to search for the content that they are interested in or need. The application platform hopes to help users find the items that they are interested in from massive amounts of information, thereby enhancing users’ experiences and improving commercial benefits. Therefore, personalized search and recommendation systems are born at the right moment, attracting extensive attention from researchers and practitioners [
1,
4,
5,
6,
7]. Personalized search and recommendation methods utilize multi-source and multi-modal information, including text, image, audio, video, and other information, with users’ preferences. They mine feature information and potential correlation to predict users’ potential intentions or interest preferences in the decision-making process for personalized recommendation services. It has become a research hotspot in the field of artificial intelligence [
8,
9,
10,
11].
With the development of machine learning and deep learning, many personalized recommendation methods have been proposed, such as collaborative filtering (CF) [
12,
13], Bayesian personalized ranking (BPR) [
14,
15], matrix factorization (MF) [
16,
17], neural-CF [
18], etc., to achieve good recommendation effects. However, users’ explicit feedback in the network requires more decision-making costs, and users need to think carefully before providing quantifiable feedback. Therefore, a recommendation system digs deeply into implicit feedback, such as clicking, liking, following, purchasing, and various behaviors, to predict users’ interest preferences for personalized recommendation tasks. Although these recommendation methods have achieved remarkable results, they still face severe challenges of data sparsity, cold start problems, and dynamic preference modeling. Deep generative models can generate new samples by learning the latent distribution of data in the incomplete information situation of missing data and noise. They possess a powerful capability of feature extraction and data modeling. They have gradually become an important method for handling complex multi-source heterogeneous data, and have been successfully applied in tasks such as image generation, text generation, and so on. However, traditional generative models face the challenges of how to integrate multi-modal data to effectively extract feature information to promote practical applications. In recent years, the diffusion model has been an emerging generative model. It simulates the propagation process of information among graph nodes to capture the similarities and correlations between nodes to achieve the feature extraction of multi-source heterogeneous data and information fusion across data modalities. A diffusion model is applied to the recommendation system, which will bring new opportunities to the research fields of personalized search and recommendation.
The needs and interests of different users vary for the same task (such as purchasing books, searching for movies, etc.), and those of the same user at different times may be different in practical application scenarios. In addition, users’ knowledge experiences, potential needs, and behavioral motivations may undergo dynamic changes with the influence of environment, time, and information. When dealing with users’ dynamic personalized demands, the coarse-grained modeling of users’ preferences limits the improvement of personalized recommendation methods. Due to the difficulty in the quantitative representation and dynamic change of users’ preferences, it is necessary to design an appropriate personalized search and recommendation framework. It models users’ interest preferences and provides feedback based on users’ information, which can accurately capture feature information to predict users’ potential preferences and dynamic trends. This further guides the model to adaptively optimize and adjust the dissemination mechanism in the iterative searching process for personalized recommendation tasks. It will improve the search efficiency, recommendation effect, and robustness of personalized search and recommendation algorithms. However, personalized search and recommendation methods still face the enormous challenges of user interaction sparsity, cold start, and long-tail recommendation. It is difficult to accurately capture users’ potential intentions and interest preferences, resulting in poor modeling of dynamic user interest preferences.
To address the aforementioned issues, this paper proposed a denoising diffusion model-driven adaptive estimation of distribution algorithm integrating multi-modal data. User-generated contents are extensively collected to explore multi-modal data information with users’ interest preferences. Multi-source multi-modal data are effectively integrated by the diffusion model. A user interest preference model based on a denoising diffusion model is established to extract the deep-seated interest preference features of users and the development pattern of user interests. In the framework of the estimation of distribution algorithm (EDA), a surrogate model based on user preferences and adaptive estimation of distribution strategies is designed in multi-modal data fusion mode to simulate users’ cognitive experiences and behavioral patterns to guide the direction of evolutionary optimization search. Meanwhile, a dynamic model management mechanism is presented to update the user interest preference model and related models to timely track users’ interest preferences. It helps users filter out items that match their interest preferences from a vast amount of information for personalized search and recommendation tasks. The feasibility, effectiveness, and superiority of the proposed algorithm have been verified through a large number of experiments on actual multi-domain public datasets. It enhances the global exploration and local development capabilities of the personalized search algorithm, which improves users’ experience and satisfaction on recommendation system platforms. It has good scalability and adaptability.
The contributions of this paper mainly include three aspects. (1) For dynamic personalized search and recommendation tasks, a user interest preference model based on the denoising diffusion model is constructed by considering multi-modal information fusion and cross-modal alignment representation. It can understand multi-modal information with users’ interests to obtain the preference features of the users. (2) The adaptive estimation of distribution strategies based on the user interest preference model is designed in the framework of the estimation of distribution algorithm. It refines users’ intention representation and interest tendency from a micro perspective to generate new individuals with the user preference to fit the dynamic change in users’ interest preferences. (3) A denoising diffusion model-driven user preference surrogate model is established to estimate the fitness of individuals to track users’ interest preferences for guiding the forward direction of the personalized evolutionary search. It helps efficiently complete personalized search and recommendation tasks.
The remainder of this study is organized as follows.
Section 2 introduces the notations of our study and related work. In
Section 3, the proposed algorithm is described in detail.
Section 4 presents comparative experiment results and corresponding analysis. Finally, the conclusion is presented.
2. Related Work
2.1. Mathematical Description for Personalized Search Problems with User-Generated Contents
Personalized search with user-generated contents (UGCs) involves retrieving optimization targets that align with users’ potential needs and personalized interest preferences from a dynamically evolving search space of massive multi-source heterogeneous data. This process ultimately generates personalized item lists—comprising products or solutions—tailored to each user. At its core, this task represents a complex and dynamic optimization problem with qualitative objectives. In the process of personalized search, users evaluate and make decisions regarding retrieved items based on their own cognitive experiences and interest preferences. However, users’ cognitive experience and interest preferences are often diverse, ambiguous, uncertain, and continually evolving. As a result, the definition of users’ satisfactory solutions is highly subjective and varies significantly among individuals. Consequently, both the search outcomes and recommendation effectiveness are ultimately determined by users’ subjective judgments. Here, the objective function
for the personalized search problem with UGCs can be defined as follows:
where
is the user set;
is the set of items (the feasible solution space), usually the feasible solution space
is large and sparse. The preference of the current user
on items
is expressed as a model function
with learnable parameters
.
A personalized search provides users with a list of recommended items from the feasible solution space, which comprises items of higher value that are likely to align with their interests. Through the presentation of these relevant items, the system completes the search and recommendation tasks, thereby stimulating user exploration and improving overall experience and satisfaction.
2.2. Recommendation Algorithms Integrating Multi-Modal Data
Various extractable and analyzable multi-modal data, including textual, visual, and auditory information, are frequently utilized in recommendation systems as important supplementary information to users’ interaction behaviors. These data enrich the representation features of users and items, thereby partially alleviating data sparsity and the cold-start problem. The integration of these multi-modal data with the collaborative information is critical. It enables the effective fusion of single-modal representations and multi-modal features, thus maintaining the integrity and diversity of the combined information. These comprehensive and accurate extracted features of user preferences and item representations are fundamental to recommendation algorithms that integrate multi-modal data.
In early research, He et al. [
19] extracted items’ image features through convolutional neural networks (CNN) and used matrix factorization to predict users’ preferences. The visual Bayesian personalized ranking (VBPR) model was proposed to use Bayesian personalized ranking to alleviate the cold start problem. Kim et al. [
20] combined CNN and probabilistic matrix factorization (PMF) to capture the context information of documents to present a convolutional matrix factorization model (ConvMF) for improving the recommendation accuracy. Chen et al. [
21] utilized the ResNet-152 model to extract modal information in images and video to propose an attentive collaborative filtering (ACF) method. Wei et al. [
22] utilized multi-modal information of vision, audio, and text to construct a user-short video bipartite graph. A multi-modal graph convolution network (MMGCN) is proposed by the topological structure of neighboring nodes to enrich the representation of each node. It can learn high-order features from the user-short video bipartite graph to improve the recommendation performance. Wang et al. [
23] used a pre-trained word embedding model to represent text features and utilized convolutional neural networks to obtain different single-level visual features from different pooling layers. A movie recommendation system based on visual recurrent convolutional matrix factorization (VRConvMF) is proposed to improve the accuracy of the recommendation system. Yang et al. [
24] designed the multi-modal module, attention module, and multi-head residual network module to extract the image features of the video cover to expand the learned feature set. The multi-head multi-modal deep interest network (MMDIN) is proposed to enhance the representational ability, predictive performance, and recommendation performance. Deng et al. [
25] proposed a recommendation model based on multi-modal fusion and behavior expansion. A learning-query multi-modal fusion module is designed to perceive the dynamic content of flow fragments and handle complex multi-modal interactions. A graph-guided interest expansion method is presented to learn the representations of users and information flows in a multi-modal attribute large-scale graph. Yan et al. [
26] presented a pre-trained model that can extract high-quality multi-modal embedding representations. A content interest-aware supervised fine-tuning is designed to guide the alignment of user preference embedding representations through users’ behavior signals to bridge the semantic gap between contents and user interests. A multi-modal content interest modeling paradigm for user behavior modeling is proposed by integrating multi-modal embedding and ID-based collaborative filtering into a unified framework.
The above-mentioned methods take into account multi-modal information, such as images, videos, text, contexts, and sounds, to conduct the feature extraction and information fusion of multi-modal data from multiple aspects. This has improved the performance of the recommendation system to a certain extent. However, the existing methods do not fully consider the correlations, differences, and dynamics among different modalities, resulting in incomplete multi-modal information mining and insufficient cross-modal alignment representation. It will affect the ability to model users’ interest preferences and lead to an insufficient understanding of users’ potential needs and deep interest preferences. Meanwhile, users’ preference information is different in various types of data, and its contribution to establishing a user interest preference model varies when conducting multi-modal information fusion. If an appropriate information fusion method is not used, a large amount of noise will be introduced, and there will be deficiencies in feature fusion and semantic association modeling. It will affect the accuracy and effectiveness of the recommendation system.
2.3. Diffusion Models
In recent years, diffusion models have achieved significant breakthroughs in domains including computer vision and natural language processing, driven by their paramount capabilities in data generation, representation learning, and sequence modeling. Consequently, they have risen as an emerging research hotspot [
27,
28,
29,
30,
31]. Technically, a diffusion model is a deep generation method whose core concept comprises a forward diffusion process of progressively adding Gaussian noise to the data and a reverse process that learns to reconstruct the data through iterative denoising. This methodology has been introduced into recommender systems, providing a novel generative modeling paradigm that can effectively alleviate data sparsity and support the recommendation integrating multi-modal information [
32,
33,
34,
35].
Li et al. [
36] proposed a sequence recommendation based on a diffusion model. Items are represented as a model distribution that adaptively reflects users’ multiple interests and multi-aspect items. The target item is embedded into the Gaussian distribution by adding noise, which is applied to the distribution representation generation and uncertainty injection of the sequential item. Based on users’ historical interactions, Gaussian noise is reverse-transformed into the representation of target items. Zhao et al. [
37] presented a denoising diffusion recommendation model. It utilizes the multi-step denoising process of the diffusion model to inject controllable Gaussian noise in the forward process and iteratively remove noise in the reverse denoising process to robust the embedding representations of users and items. Jiang et al. [
38] proposed a knowledge graph diffusion model for recommendation. By integrating generated diffusion models with data augmentation paradigms, it achieves robust knowledge graph representation learning and promotes the collaboration between knowledge-aware item semantics and collaborative relationship modeling. A collaborative knowledge graph convolution mechanism with collaborative signals reflecting user-item interaction patterns was introduced to guide the knowledge graph diffusion process. Cui et al. [
39] utilized context information to generate reasonable enhanced views to propose an enhanced sequential recommendation with context-aware diffusion-based contrastive learning. Xia et al. [
40] proposed an anisotropic diffusion model for collaborative filtering in the spectral domain. It mapped the user interaction vector to the spectral domain and parameterized the diffusion noise to align with graph frequency. These anisotropic diffuses retained significant low-frequency components to maintain a high signal-to-noise ratio. Further, the conditional denoising network is adopted to encode users’ interactions to restore the true preferences from noise data. However, diffusion models are confronted with challenges such as high computational demands and a lack of interpretability. Consequently, the future research trajectory for diffusion-based recommendation algorithms, through continued model optimization and cross-domain integration, will concentrate on avenues including efficient reasoning and causal modeling. This focus is instrumental in facilitating the transition of these models from research to practical application in real-world business scenarios.
3. Denoising Diffusion Model-Driven Adaptive Estimation of Distribution Algorithm Integrating Multi-Modal Data
3.1. The Proposed Algorithm Framework
The framework of the proposed denoising diffusion model-driven adaptive estimation of distribution algorithm integrating multi-modal data (DDM-AEDA) is shown in
Figure 1.
The proposed algorithm mainly consists of four parts: (1) multi-modal data processing, including data acquisition and vectorized representation; (2) a user interest preference model based on the denoising diffusion model; (3) a surrogate model-driven adaptive estimation of distribution algorithm; (4) a model management mechanism for dynamically tracking the evolution of users’ interest preferences.
3.2. Multi-Modal Data Processing and Its Fusion Learning Representation
A substantial volume of multi-modal user-generated content is widely collected in network environments. These data encompass historical user interactions (such as ratings and textual comments), item content information (including category tags, descriptions, and images), and social network relationships. These diverse sources contain numerous explicit and implicit user preferences. The proposed method fully utilizes the above-mentioned knowledge, which reflects user interests from different perspectives, to alleviate data sparsity in big data environments, thereby achieving a comprehensive performance improvement for personalized search and recommendation algorithms.
This section details the preprocessing and feature representation techniques applied to multi-modal user-generated contents.
- (1)
Users’ ratings: Users’ ratings on items are represented as a user rating matrix , where represents the rating of user for item . The larger the value, the more the user likes the item . Users’ ratings explicitly express the degree of users’ preferences for items.
- (2)
Items’ category tags: Items’ category tags briefly describe specific contents or feature information. To a certain extent, they reflect users’ interest preferences. Here, multi-hot encoding is adopted. Based on the discrete values of the limited category tags, the category tags of the item individual are vectorized as , where is the category tag of item and is the total number of category tags of all items. If , it indicates that the item contains the th category tag; otherwise, it indicates that the item does not contain the th category tag.
- (3)
Text comments: Text comments contain a large amount of users’ implicit preference information. Users express their latent needs and interest preferences through emotional tendencies and semantic information in text comments. Users’ text comments are collected to carry out natural language preprocessing and text vectorization representation. An unsupervised Doc2Vec model was trained on a corpus constructed from the dataset, resulting in a feature representation model that encodes the latent semantic information of users’ textual comments.
By considering the interrelationships among context, word order, and semantics, the Doc2Vec model distills high-dimensional sparse word vectors into low-dimensional dense feature vectors. It thereby learns fixed-length feature representations from variable-length text content, such as sentences, paragraphs, or documents. A text comment vectorized representation matrix with items’ ID index is generated, denoted as , where represents the text comment vectorized representation of the item and is the length of that.
- (4)
Social network relationships: Social network relationships express the friendship or similarity between users. Usually, neighboring users have similar interests or hobbies, so these social network relationships imply a large amount of users’ preference information. Here, Pearson correlation is adopted to calculate the Pearson similarity coefficient between users. The personalized recommendation algorithm leverages data from neighboring users to help infer items or content that a given user might prefer.
- (5)
Image information: A pre-trained ResNet model [
41] is utilized to extract high-dimensional visual feature vectors
of items’ images. The vectorized representation of items’ images is expressed as
, where
is the length of the vectorized representation of those images.
- (6)
Multi-modal information fusion and cross-modal alignment representation: Through the fully connected layer, the vector representations of items’ tags, comments, and images are mapped to a shared low-dimensional space to obtain the embedded representation , , and of those.
where
,
, and
are, respectively, the weight of the embedded representation of item tags, comments, and images;
,
, and
are, respectively, the bias of those.
These embedded representations are concatenated into a consistent learning representation
:
The above process achieves the fusion representation of multi-modal data information to obtain the genotype representation of the item individual .
3.3. User Interest Preference Model Based on Denoising Diffusion Model
The collection of items that users like is screened to establish a dominant population
containing users’ positive preference information. A training dataset
is formed by combining the genotype representation of individual items. A user interest preference model based on the denoising diffusion model is constructed and is shown in
Figure 2.
The training dataset is fed into the user interest preference model based on the denoising diffusion model. The training process mainly comprises two key stages: a forward process and a reverse process. In the forward process, the original data are progressively corrupted through the gradual addition of Gaussian noise.
represents the original user behaviors. It simulates a Markov Chain to obtain a series of intermediate states
, where k is the size of the time step, and the distribution of each step is a conditional Gaussian distribution. The forward process is designed to operationalize the concept of data corruption. The addition of Gaussian noise directly models the effects of implicit feedback noise, such as accidental clicks or non-purposeful browsing. Given the current state
, the next state
is obtained:
where
is the noise variance, controlling the noise intensity at each step;
is the noise of standard normal distribution.
The reverse process is also modeled as a Markov process, starting from the noise sample
to recover
by gradually removing the noise. The denoising distribution at each step is:
where
and
are the parameters learned through the diffusion model.
The reverse process
aims to reverse the noise addition in the forward process as much as possible to recover the original data
:
The true distribution
is derived from the forward process, representing the transfer of
to
. Given
and
, the true distribution
is a Gaussian distribution:
The negative log-likelihood of the forward process is minimized by adjusting the parameters
. During the training process, the variational lower bound
is used for optimization.
where
is the Kullback–Leibler divergence, which is used to measure the difference between the true distribution
and the generated distribution
.
Through this procedure, the user interest preference model based on the denoising diffusion model learns to recover the underlying data distribution by denoising corrupted inputs, progressively refining its output toward the true distribution of users’ preferences during optimization.
3.4. Surrogate Model-Driven Adaptive Estimation of Distribution Algorithm
This approach employs matrix factorization to derive latent representations of users and items, which are used to build a surrogate model for evaluating the fitness of individuals. It then incorporates an Estimation of Distribution Algorithm (EDA) to develop a probabilistic sampling model with an adaptive strategy. This mechanism boosts data utility, thereby guiding the personalized evolutionary search process for the personalized recommendation algorithm.
The learning representation
,
of the item individual is fed into the trained user interest preference model based on the denoising diffusion model to obtain the implicit representation
of users.
According to the matrix factorization model, the predicted preference value
of the current user on items is estimated by using the implicit representation
of the current user
and the learning representation of the item individual
:
where
and
are, respectively, the bias of users and items;
is the average rating of all samples.
The surrogate model
served as the fitness function to guide the personalized evolutionary search in the estimation of distribution algorithm. The dominant population
is taken as the initial population
. The denoising diffusion probability model
is obtained through the user interest preference model based on the denoising diffusion model.
where
and
are, respectively, the weight and bias of the denoising diffusion probability model;
is the neural unit of the hidden layer of that.
The population is fed into the denoising diffusion probability model to obtain the reconstructed representation of individual items. Then, the sampling probability model
in EDA is established by the reconstructed representation of individual items:
where
and
are, respectively, the mean and covariance matrix element of the reconstructed representation of individual items.
According to the elite selection strategy and the fitness function, the evolutionary individuals with higher fitness are selected to generate a subpopulation
, where
is the selection ratio. The mean
is calculated by the maximum likelihood estimation of
.
The mean
mainly controls the center of sampling offspring. An excessively high selection ratio can displace the population mean too far from the optimal fitness region. While this enhances diversity, it slows down convergence in later stages. Conversely, an overly small ratio pulls the mean closer to the optimum, which favors local exploitation but risks premature convergence. Here, an adaptive strategy of
is designed as follows:
where
and
represent the maximum and minimum selection ratios;
indicates the maximum number of fitness function evaluations;
denotes the number of fitness function evaluations up to the current generation.
According to
, a subpopulation
is formed, where
is the covariance scaling parameter,
. An archive set
is obtained by storing each generation’s subpopulation, where
is the set length of the archive set.
is obtained by combining the current subpopulation. The covariance
is estimated to select more individuals with higher fitness.
While an overly large
facilitates a wide sampling range—beneficial for early-stage diversity—it hinders focused search during later stages. Conversely, an overly small
restricts the sampling range, which is suitable for late-stage local exploitation but risks premature convergence. Therefore, the
value is dynamically adjusted throughout the evolution of the estimation of distribution algorithm. Here, an adaptive strategy of
is calculated as follows:
The proposed adaptive Estimation of Distribution Algorithm dynamically adjusts the selection ratio and covariance scaling parameter by leveraging the historical information from the archive set . This framework maintains strong global exploration capabilities during early evolution and later precisely converges to optimal regions aligning with users’ interest preferences, thereby achieving a dynamic balance in the personalized search process.
New individuals are generated by the sampling probabilistic model in EDA. Based on the Pearson similarity criterion, the similarity between these new individuals and real items in the feasible solution space is computed. The most similar items are then selected to replace the evolutionary individuals, forming a candidate recommendation set . The surrogate model estimates the fitness of each candidate. Finally, through an elite selection strategy, the Top-N items are recommended to the user, completing one interactive recommendation cycle. The population is subsequently updated using the user’s feedback to initiate the next round of the interactive personalized evolutionary search.
3.5. Model Dynamic Management Mechanism
To address the diversity and time-varying nature of user interests in complex networks, we design a dynamic model management mechanism. This closed-loop feedback system enables continuous model evolution and strategy optimization. When environmental shifts cause model prediction accuracy to fall below a preset threshold, it triggers a collaborative update of the user preference model and its related models. These models and parameters are then dynamically refined using new user-generated content, thereby promptly tracking user interests and monitoring accuracy. This process guides the personalized search toward satisfactory solutions, ultimately completing the recommendation task.
3.6. Algorithm Implementation and Computational Complexity Analysis
The specific implementation steps of the proposed algorithm are pseudocoded as follows (Algorithm 1):
| Algorithm 1: DDM-AEDA |
Input: Multi-modal UGCs Output: Top-N Item recommendation list Start
1. Multi-modal data preprocessing: Multi-modal user-generated contents are processed as described in Section 3.2; 2. Pre-trained models: User-generated contents are collected to train pre-trained doc2vec and ResNet models for representation learning; 3. Initialization: In the search space, an initial dominant group is formed by filtering a set of items aligning with user preferences derived from user-generated data; Do while (The algorithm termination condition has not been met) 4. User interest preference model: According to the method in Section 3.3, a user interest preference model based on a denoising diffusion model is constructed and trained to extract users’ preference features; 5. Surrogate model: The user preference surrogate model is designed using the formula (12) ;
6. Probability model: A sampling probability model is presented by using the formula (14) ;
7. Population update: New individuals with user preferences are generated through the probabilistic model to form a set of items to be recommended, and the fitness of individuals is estimated by the surrogate model; 8. Recommendation list: Based on the elite selection strategy, N individual items with higher rating are selected to generate an item recommendation list that user may be interested in; 9. Interactive evaluations: The item recommendation list is submitted to the current user for interactive evaluations. Based on this feedback, the algorithm checks if the termination criteria are met. If so, it concludes and outputs the final result; otherwise, the dominant group is updated with new data for the next iteration. 10. Model dynamic management: The accuracy of the user preference surrogate model is assessed. If its average accuracy falls below the predefined threshold, the process advances to Step 4 to track user preferences. Otherwise, it proceeds to Step 9 to conduct the iterative evolutionary search. End Do End |
The computational complexity of the denoising diffusion model-driven adaptive estimation of distribution algorithm, integrating multi-modal data, primarily consists of five components: the vectorized representation of text comments, the vectorized representation of image features, training the user interest preference model, screening the recommended item set, and predicting items’ scores. The vectorized representation of text and images is obtained through offline computing. The computational complexity of training the user interest preference model is . The time cost of filtering the recommended item set is , where is the total number of individual items in the feasible solution space. The time consumption for predicting items’ scores is . Consequently, the overall computational complexity of the proposed algorithm is .
4. Experimental Results and Analysis
4.1. Experimental Environment
To demonstrate the comprehensive performance of the proposed algorithm, we conducted experiments on two public datasets: the Amazon dataset [
42] (from Prof. Julian McAuley’s team at the University of California, San Diego) and the Yelp dataset [
43]. The statistical details of each dataset are provided in
Table 1.
The experimental environment is configured with an Intel Core i5-4590 CPU at 3.30 GHz and 4 GB RAM. The experimental platform is developed using Python 3.11. In the experiment, some evaluation indicators, such as Root Mean Square Error (RMSE), Hits Ratio (HR), mean Average Precision (mAP), and Normalized Discounted Cumulative Gain (NDCG), were used to evaluate the performance of personalized search and recommendation algorithms. RMSE measures the scoring prediction ability of personalized search algorithms. HR, mAP, and NDCG measure the ability of personalized recommendation algorithms to predict users’ preferences on items for high-quality recommendations, reflecting user satisfaction and usage experience. Those evaluation indicators demonstrate the prediction accuracy and recommendation performance of personalized search and recommendation algorithms.
4.2. Comprehensive Performance Comparison Experiments
To verify the effectiveness of the proposed algorithm in this paper, some comparative experiments are conducted with TruthSR [
44], MMSSL [
45], and TiM4Rec [
46]. The brief introduction of these comparative algorithms is as follows:
TruthSR: Capturing the consistency and complementarity of user-generated contents to reduce noise interference, the prediction credibility is dynamically evaluated by combining subjective and objective perspectives to conduct personalized recommendations.
MMSSL: An interaction structure between user-item collaborative view and multi-modal semantic view is constructed through anti-perturbation enhanced data. Meanwhile, cross-modal contrastive learning is utilized to capture user preferences and semantic commonalities for the diversity of preferences.
TiM4Rec: By introducing a time-aware structured mask matrix, the time information is integrated into a state space framework to conduct a time-aware Mamba recommendation algorithm.
In the experiment, some evaluation metrics, such as
RMSE,
HR@10,
mAP@10, and
NDCG@10, are used to measure the comprehensive performance of those algorithms. Each algorithm was independently run 10 times. The average experimental results are shown in
Table 2.
By observing the experimental results, the following conclusions are drawn:
- (1)
In personalized search and recommendation algorithms, the proposed DDM-AEDA algorithm has demonstrated excellent prediction accuracy and recommendation performance. Specifically, the RMSE values achieved by DDM-AEDA are predominantly superior to those of other benchmark algorithms. For instance, on the Amazon-Beauty dataset, DDM-AEDA attained an optimal RMSE of 1.120, which is 28.53% lower than that of the second-best MMSSL algorithm, 37.40% lower than TruthSR, and 30.65% lower than TiM4Rec. These comparative results underscore the strong predictive capability of the proposed method. It indicates that DDM-AEDA can effectively capture users’ dynamic preferences by leveraging the powerful modeling ability of DDPM combined with multi-step generative modeling. Furthermore, DDM-AEDA mitigates interference from multimodal noise, enabling more comprehensive modeling of complex data distributions. This leads to a more accurate alignment with user preference behaviors, thereby guiding personalized search tasks.
- (2)
The proposed DDM-AEDA algorithm improves the ranking of items in search results by arranging them in a way that better aligns with users’ interest preferences. It prioritizes items that users are likely to be interested in, placing them at the front of the recommendation list. This enhances the search and browsing experience, leading to a higher hit rate and better average accuracy. For example, in experiments on the Yelp dataset, DDM-AEDA achieved optimal performance in HR@10, mAP@10, and NDCG@10 compared to other methods. Specifically, the HR@10 value of DDM-AEDA is 23.24% higher than that of TiM4Rec (the suboptimal method), 59.68% higher than TruthSR, and 50.76% higher than MMSSL. The mAP@10 value is 3.82% better than TiM4Rec, 29.11% higher than TruthSR, and 18.21% higher than MMSSL. The NDCG@10 value is 13.15% higher than TruthSR, 28.13% higher than MMSSL, and 58.48% higher than TiM4Rec. Overall, DDM-AEDA integrates multi-modal information, such as users’ historical behaviors, text, tags, and images, to more accurately model user preferences and capture the characteristics of multi-modal data. It has enabled highly relevant items to be ranked better to strengthen personalized recommendation capabilities, leading to superior recommendation performance and increased user satisfaction.
In order to better demonstrate the comprehensive performance of the proposed DDM-AEDA algorithm in personalized search and recommendation tasks, DDM-AEDA was compared with other IECs algorithms on various datasets, such as multi-layer perceptron-driven IEDA (MLPIEDA), RBM-MSH-assisted IEDA (RIEDA_MsH) [
47], and enhanced interactive estimation of distribution algorithm driven by dual sparse variational autoencoders integrating user-generated content (DSVAE-IEDA) [
48]. In the experiments, each algorithm was independently run 10 times. The average evaluation indicators are calculated to measure the comprehensive performance of those algorithms. The average experimental results are shown in
Table 3.
By observing the above experimental results, the following conclusions are drawn:
- (1)
The proposed DDM-AEDA algorithm makes full use of multi-modal user-generated contents to construct both a user preference model and a surrogate model based on user preferences. It generates new evolutionary individuals that reflect user interests and estimates the fitness values of new individual items, thereby guiding the personalized search and recommendation process within an interactive evolutionary computation (IEC) framework. In personalized search experiments on various datasets, DDM-AEDA demonstrated superior prediction accuracy and recommendation performance compared to other algorithms. For example, on the Yelp dataset, DDM-AEDA achieved overall optimal evaluation metrics. Specifically, the average RMSE of DDM-AEDA is 3.91% lower than that of the suboptimal algorithm (DSVAEIEDA), and 27.74% and 35.75% lower than those of MLPIEDA and RIEDA-MsH, respectively. The average HR@10 of DDM-AEDA is 42.11% higher than that of the suboptimal algorithm (DSVAEIEDA), and 115.22% and 68.75% higher than those of MLPIEDA and RIEDA-MsH, respectively. The average mAP@10 of DDM-AEDA is 4.92% higher than that of DSVAEIEDA, and 36.38% and 23.93% higher than those of MLPIEDA and RIEDA-MsH, respectively. The average NDCG@10 of DDM-AEDA is 8.84% higher than that of DSVAEIEDA, and 33.17% and 36.87% higher than those of MLPIEDA and RIEDA-MsH, respectively. Although the proposed algorithm did not achieve optimal results on every evaluation metric across all datasets, it still delivered strong comprehensive search performance and recommendation quality.
- (2)
In the comparative experiments of various datasets, DDM-AEDA generally outperforms other algorithms, demonstrating its feasibility, effectiveness, and strong performance in prediction accuracy, search efficiency, and recommendation quality. These results indicate that the proposed algorithm successfully integrates diverse information sources through an interactive adaptive optimization strategy within the IEDA evolutionary optimization framework, enabling accurate modeling of users’ interest preferences. This approach facilitates the generation of high-quality evolutionary individuals, helping to preserve the relative ranking relationship within the dominant group. By leveraging knowledge extracted from high-performing solutions, the proposed algorithm effectively generates new items that align with users’ needs and preferences. These promising individuals are selectively retained in subsequent populations, guiding the personalized evolutionary search process and reducing the risk of convergence to local optima. Simultaneously, a surrogate model based on user preferences predicts item ratings, ensuring that preferred solutions are ranked at the top of the recommendation list. Items that match user interests are selected for rapid recommendation, enabling efficient identification of satisfactory solutions. These strategies enhance the rating prediction capability and overall recommendation performance of the personalized search and recommendation algorithm.
In summary, for personalized search and recommendation tasks, DDM-AEDA fully leverages multi-modal user-generated contents to construct a user interest preference model based on a denoising diffusion model. This approach effectively uncovers deep-seated potential user preferences and captures their evolutionary patterns, thereby improving the fitting accuracy of the user interest model. An adaptive EDA probabilistic model and a user preference-based surrogate model are established within the interactive evolutionary computation framework, enhancing the interactive personalized evolutionary search process and increasing the prediction accuracy of the user evaluation surrogate model. Furthermore, the proposed algorithm guides the direction of personalized evolutionary search using objective assessment indicators—such as user experience and feedback evaluation—enabling users to efficiently locate satisfactory solutions. It achieves an effective balance between recommendation quality and search efficiency. As a result, DDM-AEDA improves the optimization capability, search efficiency, and recommendation effectiveness of personalized search algorithms, while exhibiting strong stability and scalability. It is well-suited to meet the practical demands of personalized search and recommendation in complex multi-modal data environments.
4.3. Ablation Experiments
To evaluate the contribution of its core components, we performed ablation experiments on the proposed DDM-AEDA algorithm, focusing on the multi-modal feature fusion and adaptive distribution estimation strategy modules. Experimental results on the Amazon-Beauty dataset (
Figure 3) demonstrate their importance. The model configurations excluding the visual features and the adaptive strategy are referred to as “w/o Visual” and “w/o Adaptive”, respectively.
The ablation study reveals that the removal of the visual feature input module markedly degrades the comprehensive performance of DDM-AEDA. This module is designed to enhance the richness of item representation by processing visual information (color, texture, shape, etc.), thereby enabling the user preference model to account for visual tastes. This function is particularly crucial in highly visual domains such as beauty and clothing. When ablated, the model loses the ability to discriminate between items with similar appearances—for instance, lipsticks of different shades or similar sports shoes—thus significantly reducing the recommendation hit rate and ranking quality. Therefore, the visual feature input is indispensable for achieving high recommendation accuracy and user satisfaction.
Conversely, the ablation of the adaptive distribution estimation strategy also leads to a notable decline in overall performance. In the interactive EDA framework, DDM-AEDA employs this strategy to balance the search process by adapting the selection ratio (sr) and covariance scaling (cs) parameters based on historical information. When this adaptive mechanism is disabled—as in the “w/o Adaptive” experiment where sr and cs were fixed and the archive update was turned off—the model can no longer adjust its search direction in response to user feedback. This failure significantly undermines both the diversity of the recommendation results and the convergence speed, confirming that the adaptive strategy is essential for enhancing the search efficiency and recommendation performance of personalized search and recommendation algorithms.
4.4. Hyperparameter Sensitivity Analysis
The performance and efficiency of the proposed DDM-AEDA algorithm are highly sensitive to its hyperparameters. To assess this sensitivity, we conduct experiments focusing on two key parameters: the selection ratio (sr) and the covariance scaling (cs) parameters.
In our experiments,
sr was varied across
.
cs was varied across
when
sr = 0.1.
cs was varied across
when
sr = 0.2.
cs was varied across
when
sr = 0.3.
cs was varied across
when
sr = 0.4. We employed HR and NDCG as evaluation metrics. The results of this sensitivity analysis on the Amazon-Beauty dataset are presented in
Figure 4.
The experimental results reveal that with a fixed sr, both HR and NDCG metrics initially increase and then decrease as cs grows, forming a smooth curve. This trend indicates that a moderate cs value yields the best recommendation performance. An optimal balance is thus critical: a large cs promotes diversity in early-stage search but hinders convergence to optimal solutions later, while a too-small cs, despite aiding later-stage focus, risks early convergence to local optima. Therefore, we configure the DDM-AEDA algorithm with sr = 0.2 and cs = 0.4.
5. Conclusions
This paper addresses personalized search and recommendation in complex network environments by proposing a denoising diffusion model-driven adaptive estimation of a distribution algorithm integrating multi-modal data. The approach combines user interest preference modeling with a surrogate-assisted IEC. Specifically, a user interest preference model is constructed using a denoising diffusion model, incorporating multi-modal user-generated contents. Within the estimation of distribution algorithm framework, a user preference-based surrogate model is established, alongside adaptive operators and strategies designed to guide the personalized evolutionary search. A dynamic model management mechanism is also introduced to track shifts in user interest preferences. Extensive experiments demonstrate that the proposed method outperforms existing state-of-the-art algorithms by approximately 5–23% in HR@10, 4–5% in mAP@10, and 5–14% in NDCG@10. The algorithm is validated to offer advantages in functional completeness, prediction accuracy, and decision transparency, thereby enhancing both the personalized search experience for users and the overall performance of the recommendation system. Future work will focus on improving computational efficiency, ensuring system security, and strengthening user privacy protection to deliver intelligent, exclusive, and secure personalized services.