A Dual-Attention Autoencoder Network for Efficient Recommendation System

: Accelerated development of mobile networks and applications leads to the exponential expansion of resources, which causes problems such as trek and overload of information. One of the practical approaches to ease these problems is recommendation systems (RSs) that can provide individualized service. Video recommendation is one of the most critical recommendation services. However, achieving satisfactory recommendation service on the sparse data is difficult for video recommendation service. Moreover, the cold start problem further exacerbates the research challenge. Recent state-of-the-art works attempted to solve this problem by utilizing the user and item information from some other perspective. However, the significance of user and item information changes under different applications. This paper proposes an autoencoder model to improve recommendation efficiency by utilizing attribute information and implementing the proposed algorithm for video recommendation. In the proposed model, we first extract the user features and the video features by combining the user attribute and the video category information simultaneously. Then, we integrate the attention mechanism into the extracted features to generate the vital features. Finally, we incorporate the user and item potential factor to generate the probability matrix and defines the user-item rating matrix using the factorized probability matrix. Experimental results on two shared datasets demonstrates that the proposed model can effectively ameliorate video recommendation quality compared with the state-of-the-art


Introduction
Recommendation systems (RSs) [1,2], aiming at predicting the scores that users might give the items according to historical data, have been applied in many fields, such as social networking, news, information retrieval, courses [3], movies [4], music [5], and knowledge services [6]. However, with the rapid development of information science, more and more applications are continuously generating large-scale data, leading to an information explosion [7]. All these generated data can entail considerable convenience to our daily lives. Meanwhile, this phenomenon also leads to some problems, such as trek and overload of information. Moreover, how to extract valuable information from largescale data quickly and efficiently becomes a significant challenge [8]. RSs [9] prove to be effective tools to extract useful information from the chaotic application data, and have attracted extensive attention from academics and the industry. As a result, a set of stateof-the-art research works have been proposed, and many practical applications have been developed. Collaborative filtering is one of the most widely used recommendation algorithms. The algorithm generates user groups close to the current user by analyzing their historical behavior data and then determines recommendation content on the basis of the preferences of the current user group. Presently, RS has been successfully applied in many areas, including social networking, news, information retrieval, and e-commerce, and so on. With the rapid development of mobile networks and mobile services, the user and item data are split into different blocks and stored in different devices. However, most current studies focus on web-based applications with the data stored in the same database. How to design the recommendation algorithm for the split data scenario is a new challenge.
The shallow learning-based approach cannot catch the deep features of the user and items given that only one hidden level exists and relying on manual design features, traditional recommendation algorithms have limited manageability against the background of big data. With the commercialization of big data, cloud computing, and other technologies, more and more data are being recorded and stored on the Internet. These data include multiple sources of heterogeneous information, such as video [10], audio, text, graphics, and images [11]. They encapsulate rich user and item background information and individualized demands [12][13][14]. How to effectively alleviate the day sparseness and cold start problems by using the multi-source heterogeneous auxiliary information in the RS has currently become the hottest topic in academic research [15,16]. As a vital branch of artificial intelligence, deep learning technology has outstanding performance in various fields, such as image processing, NLP, and speech recognition, which introduces new commercialization directions for RS [17]. Deep learning can automatically extract features from the massive data of users and items, and obtain deep-level feature representations of users and items. Moreover, deep learning can also integrate multi-source heterogeneous data through feature fusion to obtain a decentralized feature representation [18]. When it is combined with traditional recommendation algorithms, recommendation quality can be ameliorated significantly. Google proposed the Wide & Deep model [19], which combined various linear models and deep neural network models to take the advantages of memory and generalization to make recommendations. Google's model has played a leading role. Recommendation algorithms based on autoencoders have also been extensively studied in the recommendation field. For example, AutoRec [20] is a classic application of early deep learning techniques in the recommendation field, which combines autoencoders and collaborative filtering to make recommendations. ACAE [21] is the latest application that uses an autoencoder fused with an attention mechanism to display recommendation applications, but it does not use auxiliary information. SASRec [22] is a sequence model based on self-attention. It considers the time correlation of the system to try to identify related items from the user's operation history so as to predict the next item. It has accomplished impressive results. In [23], Tal et al. proposed two models-DARIA and SARAH-which use the user's personalized historical information and item attribute information to assist recommendation through a dual attention mechanism. Good experimental results were obtained.
In this research, the author puts forward a multi-head attention autoencoder matrix factorization (MAAMF) model and an autoencoder model multi-head attention autoencoder (MAA) fused with a multi-head self-attention mechanism to mine users and items auxiliary information, and it is combined with probability matrix factorization to realize rating prediction. Autoencoders with a dual-multi-head self-attention mechanism effectively extract the hidden factors of user attribute information and video category information and observe the vital information on the user side and the video side, thus improving the interpretability of the recommendation model based on deep learning. It incorporates probability matrix decomposition to achieve rating prediction, effectively alleviating data sparseness and cold start problems. The main contributions of this paper are summarized as follows: (1) It utilizes the user's attribute information and the category information of the video to reveal the potential factors of the user and the video succeedingly and combine the probability matrix factorization to realize the rating prediction.
(2) It explores the role of the self-attention mechanism and multi-head self-attention mechanism in mining hidden factors of side information. (3) We have carried out numerous experiments on two public datasets. We extracted features by only using user attribute information, through item category information, through the self-attention mechanism, and through multi-head self-attention mechanism using user and item side information simultaneously to illustrate the effectiveness of the proposed MAAMF model.
The remainder of this article is organized as follows: Section 2 reviews the related works about the RS. Section 3 introduces the proposed MAAMF model in detail. Section 4 presents and analyzes our experimental results. Finally, Section 5 concludes.

Matrix Factorization
Matrix factorization [24] is the most frequently used recommendation algorithm. The famous Netflix Prize in the RS uses matrix factorization technology. Matrix factorization has attracted considerable attention due to its high recommendation accuracy, strong model scalability, and easy integration of side information. Common matrix factorizations include SVD [25] and NMF [26]. These methods achieve rating prediction by decomposing a high-dimensional sparse rating matrix into multiple low-dimensional matrices and multiplying multiple low-dimensional matrices to approximate the original scoring matrix. EigenRec [27] realizes top-N recommendations by constructing a low-dimensional representation of similarity between items and combining it with the classic PureSVD algorithm. In [28][29][30], Liu et al. discussed the process from traditional matrix factorization to modern RS using neural networks for multi-objective learning [31]. The abovementioned recommendation methods are based only on the user's attribute information or the category information of the videos. This paper aims to utilize the user attribute cues and videos category cue to predict the rating accurately, hence the self-attention and multi-head self-attention mechanisms are introduced to mine the hidden factors of side information from users and videos.

Collaborative Filtering
With the advancement of information technology, side material of users or videos, for instance, user or item attribute information, user portrait information, social network information, comment information, and so on, are recorded and stored. The features of users and videos or user preferences are hidden in these big data. How to ascertain useful information from these big data to ameliorate the accuracy of recommendation and reduce the problems of cold start and data sparsity is a current research hotspot. Numerous studies are applying text comment information to assist recommendation. For instance, Con-vMF [32], DeepCoNN [33], D-Attn [34], ANR [35], CARL [36],DAML [37], and other models combine all comments of users or items into a long document and then build a model to assist recommendation. Meanwhile, NARRE [38], TARMF [39], and MPCN [40] build a model on the basis of every single piece of comment of the user or item to obtain the feature of each comment, which is then combined to form the features of the user or item to make recommendations. Moreover, many studies use social network information to assist recommendation, such as [41][42][43][44], or using social network and text information to assist recommendation, such as [14,45]. How to extract hidden information from large-scale data is the main challenge faced by RS. In [46], Anastasiu et al. put forward a good research plan on how to find the nearest neighbors from big data and how to expand the potential factors. In [47], Sardianos et al. and in [48], David et al. surveyed various aspects of largescale social network-based RS, summarizing the challenges and attractive problems, and discussed specific solutions. Current research shows that the advantages of multi-task learning and ensemble learning are very obvious, and they are forceful means to solve multi-source information fusion problem [49,50].
However, in video recommendation, our intuition tells us that people of different genders, age groups, occupations, and regions will have distinct preferences. Similarly, many video websites practice classifying video information by category. One example the fact that children like to watch cartoons is judged by categories. Everyone prefers one or more categories of video. In [51], Zhao et al. applied the recommendation model on the basis of deep learning to the field of industrial video recommendation. He proposed a large-scale multi-objective ranking system, and applied it to the next video recommendation. The experiment achieved good results. In [52], Chen et al. assisted memory-based collaborative filtering recommendation by fusing video category information, which ameliorates the accuracy, coverage, precision, and recall rate of prediction, but it failed to consider the impact of user side information on video recommendation. Utilizing the side information of users and videos simultaneously for ensemble learning is a vital method to ameliorate the accuracy of video recommendation. Recently, the knowledge graph [53,54] is also introduced as the side information for the recommendation algorithms, which has achieved significant success in the field of natural language processing. Although the existing studies have made some progress in RS, no work has considered the user's attribute information and the category information of the video in an online platform to predict rating.

Proposed MAAMF Model
This section first introduces the structure of the proposed deep hybrid rating prediction model MAAMF, which integrates the dual-multi-head self-attention mechanism. By using autoencoder with the multi-head self-attention mechanism, the potential factors of the user side and the project side are determined. The factors are then integrated into the probability matrix decomposition to achieve rating prediction. The principle of the MAAMF model is explained in detail subsequently. Finally, the MAAMF model is optimized through maximum a posteriori estimation. Table 1 summarizes all the symbols used in this article to clarify our method.  Figure 1 shows the structure of the deep hybrid rating prediction model MAAMF fused with the dual-multi-head self-attention mechanism. In the figure, N is the number of users and M is the number of videos, R is the user video rating matrix, and R is the rating of the i-th user on the j-th video. U is the side information of the i-th user, such as the user's rating information and the user's attribute information. V shows the side information of the j-th item, such as the rating information of the video and the category information of the video. α is the compressed representation of U , and β is the compressed representation of V . λu is the user regularization parameter, λv is the project regularization parameter, U and V are the new approximate representations obtained by decompressing U and V through the autoencoder of the fusion multi-head attention mechanism, respectively. The purpose of MAAMF is to integrate MAA into the PMF framework and achieve rating predictions through PMF and two MAAs. This study aims to identify the potential matrix of users and videos ( ∈ × , ∈ × ) to rebuild the rating matrix R, and the observed rating matrix is calculated as follows:

MAAMF Model Structure
where ( | , ) is the probability density function of the Gaussian normal distribution with mean and variance . demonstrates the indicator function. If user i evaluates the video j, then it is set to 1, otherwise, it is set to 0.

MAAMF Model Principle
The core of our MAAMF model is the MAA module. The purpose is to use two MAA modules to obtain the potential factors of users and videos from user side information and video side information, respectively. The MAA module is shown in Figure 2. In the figure, S represents side information, specifically user attribute information or video category information in this model. X represents user rating or item rating information. S' represents decompressed auxiliary information, and X' represents the scoring information after decompression. h represents the hidden layer, and L represents the number of hidden layers. h1 represents the first hidden layer, h(L/2) represents the L/2-th hidden layer. W represents the weight parameter, and W1 represents the first layer weight parameter, W(L/2) represents the weight parameter of the L/2-th layer. The input layer uses the preprocessed tensor that is the side information as the input of the neural network. In this study, the user side information is the concatenation of the user rating matrix and the user characteristic information, and the video side information is the concatenation of the video rating matrix and the video classification information.
The coding layer contains four layers, which are compressed layer by layer from 400 to 200 to 100 to 50. After each layer passes through the fully connected layer, batch normalization is performed, and then activated by the relu activation function, and finally, dropout is performed to prevent overfitting.
The multi-head attention layer selects vital features. The multi-head attention mechanism [55] first conducts linear transformation, and then the outcome is input to the scaled dot-Product attention, which is done h times in total. One head for each time, so it becomes multi-head. means the time, means to query, means key, means the values.
, , and respectively represent the parameter matrix of , , and in the mith transformation, is the weight initialization parameter, as shown in Equations (2) and (3). The parameters are not shared among the heads, and then the h times of scaled dot-Product attention results are concatenated, and the value is calculated by performing linear transformation, which is used as the result of multi-head attention: The multi-head attention mechanism is developed on the basis of self-attention. The calculation of self-attention involves the connection between one input and all other inputs, and the calculation process mainly involves matrix multiplication. The self-attention mechanism is to perform three linear transformations on the result matrix of the embedding layer to obtain three new matrices named query matrix , key matrix , and value matrix . The query matrix is used as the score corresponding to each key matrix . The score and the key matrix are multiplied to calculate the corresponding weight, which is then multiplied by the value matrix . After performing a weighted average on the value matrix, we can finally obtain the scaled dot-Product attention. In practice, multiple parallel self-attention mechanisms are often used to form a multiheaded attention mechanism, that is, multiple attention matrices and multiple weights are used to weight the input values, and finally the weighted average results are concatenated. The self-attention mechanism can only capture the correlation between adjacent information. If multiple self-attention mechanisms are used, it can capture the correlation between relatively distant information. The final concatenation combines multiple attention mechanisms that can better describe the relationship between information at different distances.
The decoding layer corresponds to the coding layer, decompressing layer by layer from 50 to 100 to 200 to 400. Likewise, after each layer passes through the fully connected layer, batch normalization is performed, and then activated by the relu activation function, and finally, dropout is performed to prevent overfitting.
The probability density function of the zero-mean spherical Gaussian distribution above the user potential model with a difference of is shown in Equation (4): Different from the traditional PMF model, we assume that the user latent vector is affected by the MAA internal weight , represents the ancillary information of the i-th user, and is the influence of Gaussian noise, and we obtain Equation (5) as follows: Using them to optimize the rating of the user's potential model, hence we obtain Equation (6): where is its variance, and k represents the dimension of the latent factor. The user potential model based on conditional distribution is written as, where A represents the side information of the user, the latent vector ( , ) obtained by the MAA module is used as the mean of the Gaussian distribution, the variance I is the variance of the Gaussian distribution, I is the identity matrix, and n is the user number.
Similarly, we can obtain the probability density function of the internal weight W of the item as shown in Equation (8). In addition, the probability density function of the item based on conditional distribution is shown in Equation (9), where is its variance, and k represents the dimension of the potential factor, C represents the ancillary information of the video, the latent vector , obtained by the MAA module is used as the mean of the Gaussian distribution, the variance is the variance of the Gaussian distribution, I is the identity matrix, and m is the item number: ( | , C, ) = | , ,

Model Optimization
The variables ， , , and are optimized by using maximum a posterior estimation. In Equation (10) A represents user side information, C represents project side information, and R represents predicted rating. U represents the user potential factor, V represents the project potential factor, and i and j represent the variables of the user and the project. n is the number of users, m is the number of videos, W is the internal weight of each model, and is the variance of the corresponding Gaussian distribution: Substituting equations (1), (6), (7) (8), (9) into Equation (11), and then making both sides negative we obtain Equation (12): We utilized the coordinate descent method to minimize ℒ, and iteratively optimize the latent variables while fixing remaining parameters. Equation (12) is the quadratic function of U. Assuming that V and W are constants, the loss function ℒ can be obtained by differentiation. Take the same operation on V to obtain the following expressions: where ， ， , and are diagonal matrices, and and are balance parameters. Equation (13) shows the effect of updating by the user potential vector ( , ) through , and Equation (14) shows the effect of updating by the project potential vector , through . However, we cannot update and similar to U and V, because and are linked to the nonlinearity in the MAA architecture. When U and V are temporarily constant, the loss function ℒ can be referred to as the regular term weighted square error function of . We obtain ℰ( ) and ℰ( ) as given in equations (15) and (16).
By optimizing U, V, , and W, we can finally obtain the predicted user's rating Equation (17): where R is the expected value of rating.

Experimental Results and Discussion
In this section, two shared datasets from the video field are used to verify the proposed MAAMF model. First, the datasets and evaluation indicators are described. Then the environment and basic parameter settings of the experiment are introduced, and then the principle of the baseline model is briefly described. Finally, the best result was obtained through experimental adjustment, and the final result was studied and compared to the baseline model.

Datasets and Evaluation Indicators
To accurately evaluate the performance of the MAAMF model, experiments were conducted on the MovieLens shared dataset. The MovieLens dataset [56] is one of the most widely used open-source datasets in the field of video recommendation. Users' ratings of films range from 1 to 5. This experiment uses two subsets ML-100k and ML-1m, as shown in Table 2. The table presents that ML-100k is rating data in order of 0.1 million, and ML-1m is rating data in order of 1 million. Prediction accuracy measures the ability of a recommendation model to predict user behavior, and is the most vital offline evaluation indicator for recommendation algorithms. This research aims to ameliorate the accuracy of rating prediction. Root mean square error (RMSE) and mean absolute error (MAE) are used to evaluate the performance of the MAAMF model. For user U and video V in the dataset, is the actual rating data of user for video , and ˆ is the predicted rating data given by the recommendation model, is the number of ratings in the test set. The definitions are as (18) and (19), respectively:

Experimental Software and Hardware Environment and Parameter Settings
This experiment was carried out on the integrated development platform PyCharm Community Edition 2020.1.1 x64 of Windows 10, using Python 3.7 language development, and the Keras 2.3.1 deep learning library. The equipment environment is an Intel Core i7 4790K 4.00 GHz CPU, equipped with a GeForce GTX 1080 GPU and 32 GB of RAM.
The project side has 18 categories, and an empty category is designed and added to form 19 categories. If a video does not have a category, the 19th column is 1, otherwise, it is marked 1 in the corresponding column. If a video belongs to more than 1 category, then 1 is marked on each of these corresponding columns to obtain the category representation matrix of the video. The auxiliary information on the user side uses the user's ID, gender, age, occupation, and postcode, which are transformed into a vector through one-hot encoding. Training data recorded for 80%, cross-validation data recorded for 10%, and test data recorded for 10%. The dimension of the hidden factor is 50.

Baseline Models
Six well-known baseline models, including PMF, SVD, aSDAE, R-ConvMF, PHD, and DUPIA are compared to the MAAMF model proposed in this study. The are briefly described as follows: PMF [24]: Probability matrix decomposition is a widely known collaborative filtering model. It explains the rating prediction task from the perspective of probability. It assumes that the rating obeys a Gaussian distribution.
SVD [25]: Singular value decomposition model is a prominent matrix factorization model, which uses rating information to achieve rating prediction.
aSDAE [57]: aSDAE is a two-tower model, which extracts the hidden factors of the auxiliary information on the user side and the project side through two SDAEs. The item side information uses the word bag model as the input of the deep learning model. R-ConvMF [58]: Convolutional matrix factorization is a text context-aware hybrid recommendation prototype. It uses item statistics that capture contextual information, integrates CNN into the PMF framework, and considers Gaussian noise in different ways.
PHD [59]: PHD is a new hybrid model that uses an autoencoder to extract user-side hidden factors. Meanwhile, it uses a convolutional neural network to extract item-side text hidden factors, which is then integrated into the PMF framework to achieve rating prediction.
DUPIA [60]: The DUPIA model is based on the PHD model and adds a self-attention to the CNN on the item side.

Analysis of Experimental Results
In this part, we first explored the influence of different parameters on the experimental results, and then compared the MAAMF model we proposed to the baseline models.

Influence of Balance Parameters on Experimental Results
Tables 3 and 4 test the changes of RMSE and MAE on the two shared datasets under different combinations of λu and λv, and the best results are shown in bold. The tables exhibit that when λu = 8, λv = 50 on ML-100k and λu = 3, λv = 120 on ML-1m, the MAAMF model achieves the best results. It indicates that the appropriate combination of λu and λv plays a vital role in balancing user-and project-side information. Adjusting the λu and λv hyperparameters has a vital influence on RMSE and MAE. Dropout is one of the most effective and commonly used regularization methods in neural networks. It reduces the risk of overfitting by randomly discarding some layer output features. Figure 3a,b present that when the dropout is 0.2, the ML-100k achieves the optimal MAE value of 0.6985, and when the dropout is 0.4, the ML-1m achueves the optimal MAE value of 0.6574. To reduce overfitting, dropout regularization significantly improved the model's generalization ability.

Changes of RMSE Value with the Number of Iterations
In the experiment, we assume that when the RMSE value of the test set increases more than five times during the iteration process, the training is terminated. Figures 4 and  5 show that the RMSE values of the cross-validation and the test set are very close. The RMSE value of the training set increased once before decreasing continuously. The RMSE value on the cross-validation set and the test set initially decreased and then slowly increased.   Table 5 shows the RMSE results of the proposed model and five well-known RS models on two shared datasets, where MAAMF-U in the table indicates that only user attribute information is used to assist recommendation, and MAAMF-I indicates that only item category information is used to assists recommendation. MAAMF-UI-S represents a user and item information-assisted recommendation through the fusion of self-attention mechanism, and MAAMF-UI-M represents user and item information assisted recommendation through the fusion of multi-head self-attention mechanism. The results of SVD and aSDAE in the table reveal that if there is little hidden information in the auxiliary information, or the training effect of deep learning is not good, the classic traditional recommendation models may outperform the recommendation models based on deep learning. The comparison of the experimental results of MAAMF-UI-S and MAAMF-UI-M with the results of other models shows that the vital information in the auxiliary information can be paid attention to by fusing the attention mechanism. Table 5. RMSE result of different models in the two shared datasets. Results of the state-of-the-art models are underlined, and the optimal result of our model is marked in bold. Compared with the most advanced model DUPIA, the RMSE value of MAAMF-UI-S on the ML-100k dataset has improved by 3.46%, whereas the performance of DUPIA on the same dataset has improved by 1.70% compared with PHD, which is lower than the improvement of MAAMF-UI-S. The RMSE value of MAAMF-UI-M on the ML-1m dataset has improved by 0.47% when compared with DUPIA, which has an improvement of 0.31% on the same dataset when compared with PHD, lower than MAAMF-UI-M's improvement. The comparison of experimental results of MAAMF-U and MAAMF-I shows that different auxiliary information affects the experimental results distinctly. In this study, the recommendation assisted with item category information achieves a better result than the recommendation assisted with user attributes. Compared with MAAMF-U, the RMSE value of MAAMF-I on the ML-100k dataset has improved by 1.24%. Among the well-known recommendation algorithms, the PHD and DUPIA model that uses user and project auxiliary information simultaneously has the same design thought as MAAMF-UI-S and MAAMF-UI-M. It has an outstanding performance on ML-1m with a large amount of data, which shows that integrating multi-source information of users and projects simultaneously and expanding the number of samples in the training dataset are the future research directions. Given that the RMSE value of MAAMF-UI-S on ML-100k is lower than that of MAAMF-UI-M may result from the relatively small scale of the ML-100k dataset, and the deep learning algorithm shows instability on small-scale datasets. When PHD and MAAMF are compared, the optimal deep learning based model and the optimal neural attention mechanism based model respectively, the RMSE values of the latter are respectively 5.10% and 0.78% better than the former.

Models
In addition, we performed a one-tailed paired t-test test on the results of the DUPIA model and the MAAMF-UI-M model under different training data percentages in the ML-1m dataset. We found that there is a significant difference between the results of the DUPIA model and the MAAMF-UI-M model, with the p-value of the t-test reported in Table 6. The experimental results prove that our model outperforms the state-of-the-art model. The overall results of the experiment show that user attribute information and video category information impact the score prediction results. Mining the potential factors on the user side and the potential factors on the item side simultaneously can more effectively assist the recommendation. In addition, importance of the information in the auxiliary information varies, and more attention on more important information will improve recommendation outcome. This provides support for future research on constructing a unified multi-source information fusion framework.

Comparison of Experimental Results Under Different Percentages
Finally, the effect of different training data percentages on the RMSE results is also studied. The results are shown in Figure 6. The figure reveals that the proposed MAAMF model obtains the best result except for 20% training data. In the figure, except for the PMF and SVD models, the other models are applications of deep learning in the RS. The results in the figure show that recommendation models combined with deep learning achieve better results than the traditional recommendation models, but the auxiliary information source and the algorithm used to extract potential factors significantly impact the experimental results. The figure also presents that for the same model, as the percentage of training data increases, the result of RMSE gradually decreases, and the accuracy of rating prediction gradually increases. The RMSE curves of the aSDAE and R-ConvMF models cross, and the RMSE curves of the PHD and MAAMF-UI-M models cross, indicating that the robustness of the recommendation model based on deep learning must be ameliorated. In this experiment, we discussed the influence of the regularization parameters λu, λv, and dropout on the experimental results, and described the changes in the RMSE values of the training set, cross-validation set, and test set under different iteration times. A comparative study between the model proposed in this study and the five well-known baseline models reveals the feasibility of the proposed MAAMF model. The performance of different models is compared through experimental profits under different training data percentages. PMF and SVD are classic collaborative filtering recommendation algorithms. aSDAE, R-ConvMF, and PHD are recommendation algorithms based on deep learning. DUPIA and MAAMF are recommendation algorithms based on neural attention mechanism.
The overall results of the experiments and the comparison results under different percentages reveal that these models are constantly developing and improving, from solely using scoring data to using scoring data and auxiliary information, from using auxiliary information to paying more attention to important information in auxiliary information. The experiment unveils that the use of side details can ameliorate the accuracy of rating prediction, and different side information significantly impacts the prediction results. Making full and effective use of multi-source heterogeneous data can better assist recommendations. In addition, the success of MAAMF demonstrates the importance of the user's attribute information and the category information of the video in the RS, which was ignored by previous research.

Conclusions
In this study, we proposed an autoencoder model fused with a multi-head self-attention mechanism to mine user and item auxiliary information combined with probability matrix decomposition to achieve rating prediction. The autoencoder model fused with the dual multi-head self-attention mechanism effectively extracts the hidden factors of the user's basic attribute information and the item category information and pays attention to the vital information on the user side and the item side. It ameliorates the efficiency of the recommendation model on the basis of deep learning. The autoencoder model is combined with probability matrix decomposition to achieve rating prediction. Experimental results on two shared datasets of ML-100k and ML-1m demonstrate the usefulness of the proposed MAAMF model. Exploring more hidden factors in auxiliary information is a vital way to ameliorate the quality of recommendation. Furthermore, the time dimension of auxiliary information is also a vital factor affecting recommendation. For example, most people prefer to pay attention to the latest movies. How to better utilize time details is the focus of future research. We have also noticed that some auxiliary information will also influence one another. Ultimately, future studies must examine that how to effectively integrate the features of each auxiliary information in the RS. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Data Availability Statement: Data of our study is available upon request.

Conflicts of Interest:
The authors declare that there is no conflict of interest regarding the publication of this paper.