We employ the Tensorflow and Keras frameworks to conduct a series of experiments. The following contents are carried out using the dataset adopted in the experiment, baseline comparison methods, relevant parameter setting, evaluation metrics, and experimental effect comparison.
4.1. Datasets
We select the data of the Amazon platform, which provides not only a large number of numerical data, but also the textual review of the users on the corresponding products (
http://jmcauley.ucsd.edu/data/amazon/, accessed on 30 January 2021). During the data sampling phase, we employed a stratified random sampling strategy based on the distribution of ratings. As the original dataset is excessively large, for the sake of experimental efficiency, we extracted 50,000 samples from each of the “Electronic” and “Book” Amazon datasets, totaling 100,000 samples. This method ensures that our smaller subset aligns with the original dataset’s user rating tendencies. In the data preprocessing phase, we focused on standardization, as well as cleaning punctuation and stop words. Standardization involved converting all text to lowercase and removing HTML tags. The cleaning process eliminated all punctuation and used the standard English stop word list provided by the NLTK library to filter out high-frequency but low-information words.
Table 1 describes the variables used in these two datasets, and
Table 2 presents the detailed statistical information of the data.
The fields contained in these two datasets are “reviewer ID” (ID of the reviewer), “asin” (ID of the product), “reviewer Name” (name of the reviewer), “helpful” (helpfulness rating of the review), “review Text” (text of the review), and “overall” (rating of the product). We supply the data of “reviewer ID”, “asin”, “overall”, and “review Text” to the proposed model. Among them, “reviewer ID”, “asin”, and “overall” are numerical data, representing the users’ ratings on the corresponding products, while “review Text” is textual data, representing the users’ specific evaluations of the products. In each round of experimentation, we randomly select 80% of the data as the training set, 10% as the validation set, and the other 10% as the test set. After using the training data for model training, the trained model is used in the test set to evaluate model performance.
4.2. Baseline Methods
To validate the effectiveness of the proposed method in this paper, we selected a series of representative baseline methods for experimental comparison. This set of baselines covers multiple dimensions: it includes classic models like Factorization Machines (FM) and Deep Neural Networks (DNN) for feature interaction and nonlinear learning, as well as advanced models like Deep Factorization Machine (DeepFM) and MLP_NLP that represent current trends in efficiently integrating heterogeneous features. Additionally, to examine our model’s performance from more perspectives, we incorporated the Time Effect Collaborative Filtering (TECF) model, which specifically addresses the temporal dynamics of user preferences, and the Stacked Autoencoder as a representative of unsupervised representation learning. These methods were chosen to cover a range from classic recommendation algorithms to cutting-edge deep learning models, aiming to comprehensively evaluate our model’s performance.
FM: The factorization machine model constructs the feature combinations by finding the inner product of the hidden variables of each dimension feature. The core idea of FM is to learn a low-dimensional latent vector for each feature and use the inner product of these vectors to efficiently model second-order interactions between all feature pairs. This effectively addresses the feature combination explosion problem encountered by traditional polynomial models when handling large-scale sparse data. In this study, we chose FM as a baseline model for the following reasons: our proposed model aims to capture higher-order nonlinear feature interactions through deep networks, while FM focuses on explicit second-order interactions. Therefore, directly comparing with FM can clearly demonstrate the effective gains of our model in learning complex higher-order relationships [
14].
DNN: DNN forms the foundation of deep learning technology, enabling the automatic learning and extraction of complex nonlinear relationships from data through a network structure with multiple hidden layers. In this study, we use DNN as a baseline model for the following reasons: since our proposed model is a deep learning architecture itself, a basic DNN model serves as the most direct and fair comparison. By comparing performance with the DNN, we can clearly assess the real performance improvements brought by specific modules we designed, such as text processing and feature fusion mechanisms, thereby strongly demonstrating the innovative value of our model architecture [
15].
DeepFM: DeepFM is an advanced recommendation model that combines the advantages of Factorization Machines (FM) and Deep Neural Networks (DNN) through a clever end-to-end parallel architecture. In this model, the FM component efficiently learns low-order (especially second-order) explicit feature interactions, while the DNN component delves into complex and implicit nonlinear relationships hidden in the data. Both components share the same feature embedding layer, greatly enhancing the model’s learning efficiency and expressive power. In this study, DeepFM is chosen as the baseline model primarily due to its successful integration of different levels of feature interactions, which directly aligns with our research goal of integrating heterogeneous features (text and numerical data). By comparing performance with DeepFM, we can clearly evaluate whether our proposed integration strategy offers superiority over this established interaction fusion paradigm [
16].
TECF: The TECF model is designed to capture the dynamic evolution of user interests. It builds upon traditional collaborative filtering by introducing a time decay factor, assigning different weights to users’ historical interactions. Behaviors closer to the current time are considered more reflective of users’ true interests and are thus given higher weights. In this study, TECF is chosen as a baseline model to examine the effectiveness of our method from a completely different perspective. By comparing it with TECF, we can assess whether the user preference representation derived from a deep understanding of content can outperform a model specifically designed to handle temporal dynamics. This comparison helps to comprehensively validate our model’s ability to capture users’ core, stable preferences and demonstrates that the textual information we introduce provides fundamental value beyond simple temporal patterns [
17].
Stacked Autoencoder: The Stacked Autoencoder is a classic unsupervised deep learning model, formed by stacking multiple autoencoders layer by layer. Each layer learns a compressed representation of the output from the previous layer and attempts to reconstruct its input. In this study, the Stacked Autoencoder is chosen as a baseline model primarily to evaluate whether our model, which utilizes label information for end-to-end learning, can learn feature representations that are superior to the high-quality features extracted solely from the data’s inherent structure [
18].
MLP_NLP: MLP_NLP is a deep learning architecture that combines natural language processing techniques with a multilayer perceptron to process and utilize text information. The core idea is to use NLP methods to convert unstructured data, such as user text reviews, into numerical vectors, which are then sent to a deep neural network for training. In this study, MLP_NLP is chosen as the baseline model primarily because it represents a classic paradigm of applying deep learning to natural language processing tasks. One of the core innovations of our proposed new model lies in the way it processes text data and integrates it with numerical features. By comparing it with MLP_NLP, we can evaluate the performance improvements brought by our proposed method [
19].
4.3. Parameter Details
In the proposed model, the dimension of the embedded layer is 32, i.e., the DNN layer has 32 neurons, the activation function selects “Relu”, the optimizer adopts “Adam”, the loss function selects the “cross entropy” function, the training epoch of the model is 10, and the number of training samples per batch is 1000. Additionally, early stopping is employed during training to prevent overfitting. The parameters to convert the comment text summary into a vector are set as follows: the dimension of the feature vector is 100, the window size is 10 (the window size indicates the maximum distance between the current word and the predicted word in a sentence), “min_count” is 5 (the minimum word counting is used to truncate the dictionary, words with a frequency less than this value is discarded), the value of alpha is 0.01 (alpha represents the initial learning rate), the value of the sampling threshold is 1 × 10−5, “workers” (it represents the number of parallelism used to control training) is set as 1, and the training period is set at 50.
To facilitate the comparison of the model experimental effects, all the baseline models, DeepFM, DNN, FM, Stacked Autoencoder, MLP_NLP use the optimizer “Adam” in the training process, and the loss function uses the “cross entropy” function, the model training epoch is 10, and the number of training samples per batch is 1000. The TECF model adopts the item-based collaborative filtering algorithm. The Stacked Autoencoder uses two stacked coding layers, the number of coding layer neurons is set to 32 and 16, respectively, and the activation function uses the “Relu” function. The number of neurons in the decoding layer is set to 6. That is, when the value of user ratings is expressed as an integer from 0 to 5, the feature vector dimension obtained by “one-hot” coding is adopted, and the activation function of this layer uses the sigmoid function. In the DeepFM model, the number of hidden factors in the quadratic part of FM is set to 100, and the number of hidden layers in the DNN is set to 2. These two hidden layers have 32 and 16 neurons, respectively, and the activation function is set to “Relu”, the number of neurons in the output layer of the DNN is set to 6, and the activation function is “sigmoid”. Finally, combining the DNN and the FM, the final output layer dimension is 6, and the activation function is “sigmoid”. The DNN model has an input layer, two hidden layers and an output layer. The input layer is used to input numeric data representing the users and corresponding items. The hidden layers map these data to multi-dimensional vectors. The dimensions of the two hidden layers are 32 and 16, respectively, and the activation function is “Relu”. The number of neurons in the output layer is 6, and the activation function adopts the “sigmoid” function. The number of hidden factors in the FM model is set to 100. The MLP_NLP model sets an embedding layer, which is used to convert text data into a multi-dimensional vector, and the dimension of the multi-dimensional vector is set to 32. At the same time, the model sets two hidden layers, the number of neurons in the two hidden layers is 32 and 16, respectively, and the activation function is “relu”. The number of neurons in the output layer is 6, and the activation function selects the “sigmoid” function.
All experiments were conducted on a system equipped with an Intel Core i9 CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The models were implemented in Python 3.6, with key libraries including NumPy 1.21.2, Pandas 1.4.3, Scikit-learn 1.1.1, and Gensim 4.2.0 for Doc2Vec text vectorization. For each model, we conducted systematic grid searches, primarily tuning the learning rate, batch size, hidden layer dimensions, and dropout rate. Each model was evaluated by averaging at least five independent runs to ensure result stability.
4.5. Performance Comparison
Our experiments were conducted on Amazon’s “Electronic” and “Book” datasets. The abstract of the comment text is extracted during the textual embedding process, and the comment text is converted into an embedding vector according to the abstract. As the abstract is a summary of the user comment text, it extracts the central idea of the comment text. Therefore, embedding the text abstract into the model can not only improve the performance of the model, but also avoids the effect of the redundant and complicated text description on the model. The proposed method is a new DNN, based on multi-form information embedding. For the ease of description, we will abbreviate it as the MENN model.
Table 3 shows the performance of the various models on the “Electronic” and “Book” datasets. The smaller the evaluation metrics MAE, MSE, and RMSE, the better the prediction error of the model, indicating superior model performance. The best results of the evaluation metrics are highlighted in bold. Clearly, the MAE, MSE, and RMSE of FM and TECF are larger than those of the other models, suggesting that the performance of FM and TECF in mining online user preferences is not as good as the other models. This may be because these two models are not based on DNN, and DNN has powerful feature representation capabilities. Next, the DeepFM model has poorer evaluation metrics. Compared to TECF and FM, DeepFM partially incorporates DNN, which improves the performance of the model to a certain extent. Though the DeepFM model uses DNN in the model construction process, given the poor performance of the FM, the FM part of the DeepFM model may still lower the overall model performance. The other DNN based models, such as Stacked Autoencoder, DNN, MLP_NLP, and the MENN model, have relatively better evaluation metrics, and the MENN model performs the best.
Figure 4,
Figure 5 and
Figure 6 show the improvement rate of the MENN model compared to the other models on the two datasets “Electronic” and “Book”.
In
Figure 4,
Figure 5 and
Figure 6, the abscissa represents the comparison between MENN and the other baseline models, and the length of the ordinate represents the improvement ratio of the MENN model compared with the other baseline models. The formula for the size of the ordinate in
Figure 4 can be expressed as
The method for representing the ordinate of
Figure 5 and
Figure 6 is similar to that of
Figure 4, except that the ordinate of
Figure 5 measures MSE and the ordinate of
Figure 3 measures the RMSE. From
Figure 4,
Figure 5 and
Figure 6, clearly the performance improvement effect of MENN model has reached 90% at the maximum and 10% at the minimum. Overall, the performance improvement of MENN compared to TECF, DeepFM and FM models is the largest, which is consistent with the earlier analysis. Second, MENN performs better than Autoencoder and MLP_NLP. Although Autoencoder and MLP_NLP both belong to the DNN category in terms of model construction, due to the difference in the internal operating mechanism of the models, the effect of the models will be different. From the evaluation metrics, MENN has the least improvement compared with DNN, followed by the MLP_NLP model. The reason for this result is that although MENN, DNN and MLP_NLP belong to the DNN category in the model architecture, the DNN in this experiment is mainly analyzed from the perspective of numerical data embedding, while MLP_NLP uses NLP technology and multi-layer perceptron model to analyze from the perspective of text-based data embedding. Moreover, when processing the text data, MLP_NLP does not consider the possible adverse effects of high-frequency redundant parts in the text corpus on the model effect. The MENN model also considers the role of numeric data embedding and text data embedding. This method effectively combines the advantages of these two embedding methods in model feature learning. Therefore, compared with the advanced DNN and MLP_NLP, MENN has better performance.
The performance improvement of the MENN method is primarily attributed to the following factors: compared to traditional methods like FM and TECF, MENN leverages a deep neural network architecture to capture more complex nonlinear feature interactions, explaining the significant performance gap with these baseline models. While the improvement over DNN and MLP_NLP is smaller, this reflects the incremental advantage of multimodal fusion—MENN processes numerical and textual features simultaneously through parallel embedding layers, preserving complementary information that single-modality methods might miss, especially in capturing subtle user preferences. To validate MENN’s performance compared to the best-performing baseline, we conducted paired t-tests on results from five independent experiments. The test showed a p-value much less than 0.05 (p ≈ 0.001), indicating that the observed performance gains are statistically significant.
To explore the performance of the model under various dataset sizes, we merged the two datasets “Electronic” and “Book” into an overall dataset. The number of records randomly extracted from the overall dataset was 20,000, 40,000, 60,000, 80,000, and 100,000. The performance of each model is analyzed under these five datasets. The performance of the models is shown in
Figure 7,
Figure 8 and
Figure 9.
As can be seen from
Figure 7 to
Figure 9, the changes in MENN, DNN and MLP_NLP are small as the dataset size increasing, and the advantage of these models are smaller prediction errors and better performance. The evaluation metrics of Autoencoder fluctuate slightly with dataset size. As the dataset increases, the Autoencoder’s prediction error increases. When the dataset continues to increase, the Autoencoder’s prediction error begins to decrease. Similarly, the evaluation metric values of the TECF model and the FM model with the least satisfactory performance also appear to vary as increasing dataset size. The fluctuation range of the prediction error of the DeepFM model is larger than that of other models. It is plausible that the fluctuation of DeepFM model is not only affected by the DNN model, but also by the FM model. Given the dual impact of the fluctuations of the FM and the DNN, the fluctuation range of the DeepFM is slightly higher than other models. In addition, looking at the trend in
Figure 7,
Figure 8 and
Figure 9, the error values of the Autoencoder, MLP_NLP, DNN, and MENN models that are based on DNN are relatively close. The evaluation metric values of these models are much smaller than TECF, FM, and DeepFM. This phenomenon occurs because TECF and FM are not based on a DNN model, and DeepFM uses a DNN architecture, offering better performance.
Figure 10 and
Figure 11 are visualizations of embedding text comment data on the two datasets “Electronic” and “Book”, respectively. These three-dimensional visualizations were generated using the t-SNE dimensionality reduction method. After summarizing these text comment data into concise abstract text, the abstract text is converted into feature vectors that can be embedded in the model. In practice, these feature vectors are multi-dimensional vectors. To facilitate visual display, these text embedding vectors are reduced to 3-D vectors. Each dot in the figure represents a 3-D vector, and this 3-D vector represents a user’s comment record on the corresponding product, portraying it an example of user comment text vector embedding. As shown in
Figure 10 and
Figure 11, the visual images of many users’ comments text are clustered together and very similar, which means that the comment text they represent has similar meanings. In general, the user preferences reflected by the user’s textual comments are consistent with the trend of the user’s rating on the items. Users with a rating of 5 have a different comment tendency compared to those with a rating of 1. This suggests that spatial distance influences the user-generated text embedding vectors with different comment tendencies. Users with similar preferences may have a similar comment tendency on the product, resulting in the text embedding vectors generated being relatively similar in spatial distance and thus clustered. When a user’s rating for an item is less than 3, the user’s preference on the item is low. In
Figure 10 and
Figure 11, blue dots are used to indicate the embedding of these users’ text comments. In the legend, “class 1” is used to represent such data. Conversely, when the user’s rating for an item is greater than 3, the user prefers the corresponding item. The red dots indicate the embedding of these users’ text comments. In the legend, “class 2” is used to represent such data. Clearly, most text embedding vectors with similar degrees of preference are clustered, indicating that the tendency of related text comments is relatively close. The embedding vector formed by text comments can help DNN models better mine user preferences.