Sparse Data Recommendation by Fusing Continuous Imputation Denoising Autoencoder and Neural Matrix Factorization

In recent years, although deep neural networks have yielded immense success in solving various recognition and classification problems, the exploration of deep neural networks in recommender systems has received relatively less attention. Meanwhile, the inherent sparsity of data is still a challenging problem for deep neural networks. In this paper, firstly, we propose a new CIDAE (Continuous Imputation Denoising Autoencoder) model based on the Denoising Autoencoder to alleviate the problem of data sparsity. CIDAE performs regular continuous imputation on the missing parts of the original data and trains the imputed data as the desired output. Then, we optimize the existing advanced NeuMF (Neural Matrix Factorization) model, which combines matrix factorization and a multi-layer perceptron. By optimizing the training process of NeuMF, we improve the accuracy and robustness of NeuMF. Finally, this paper fuses CIDAE and optimized NeuMF with reference to the idea of ensemble learning. We name the fused model the I-NMF (Imputation-Neural Matrix Factorization) model. I-NMF can not only alleviate the problem of data sparsity, but also fully exploit the ability of deep neural networks to learn potential features. Our experimental results prove that I-NMF performs better than the state-of-the-art methods for the public MovieLens datasets.


Introduction
In the era of information explosion, big data exhibits a rich value and great potential, which brings transformative development to human society, but it also generates the serious "information overload" problem. Recommender systems are an effective way to alleviate the problem of "information overload", having been widely adopted by many online services, including E-commerce, online news, and social media sites [1]. Recommender systems can help determine which information to offer to individual consumers and allow online users to quickly find the personalized information that fits their needs [2]. Collaborative Filtering (CF) is a successful approach commonly used by many recommender systems [3]. CF [4,5] is based on the user's past interaction records (such as ratings) to simulate the user's preferences for the item. The scarcity of original interaction records has always been a difficult point for CF, which is the problem of data sparsity.
As the most popular approach among various CF techniques, Matrix Factorization (MF) [6,7], which projects users and items into a shared latent space, using a vector of latent features to represent a user or an item, has become a standard model for recommendation due to its scalability, simplicity, and flexibility [8,9]. Most previous research [3,7,[10][11][12][13][14][15] on MF did not change the nature of linearity, such as integrating it Appl. Sci. 2019, 9, 54 2 of 19 with neighbor-based models [7], combining it with topic models of item content [3], and extending it to factorization machines [15] for a generic modelling of features. Despite the effectiveness of MF for CF, such a linear model is insufficient to understand the complex and non-linear relationship between users and items [16]. Furthermore, the problem of data sparsity is difficult to solve in MF.
Of course, there has been some advanced research [1,2,17,18] on MF that essentially changed the linear structure of MF by combining the idea of deep learning. The combination of deep learning and MF mainly uses the implicit feedback data to learn the hidden vector of the user(item) by using the deep learning method, thereby predicting the user's preference for the item based on the hidden vector. Among existing models, an outstanding performance was shown for the NeuMF (Neural Matrix Factorization) model proposed in [1]. The NeuMF model combines matrix factorization and a multi-layer perceptron to achieve a good performance.
MF based on deep learning can be regarded as a kind of nonlinear generalization of the traditional hidden factor model. Its obvious advantage is the introduction of nonlinear feature transformations in the process of learning the hidden vector of the user(item), which has a better performance than traditional MF. However, this method still cannot improve the problem of data sparsity in MF.
In this paper, in order to alleviate the problem of data sparsity, we propose a new CIDAE (Continuous Imputation Denoising Autoencoder) model based on the Denoising Autoencoder (DAE). The CIDAE model performs regular continuous imputation for the missing parts of the original data and trains the imputed data as the desired output, which is different from the DAE algorithm that randomly adds noise to the input data. The advantage of CIDAE is that it alleviates the problem of data sparsity by using the idea of imputation.
Then, this paper optimizes the advanced NeuMF model based on the training process. The NeuMF model introduces the idea of deep neural networks into matrix factorization. The NeuMF model first proposes the GMF (Generalized Matrix Factorization) model and then combines the GMF model with the MLP (Multi-Layer Perceptron) model. We name the optimized NeuMF model the O-NeuMF-p model. Additionally, our experimental results prove that the accuracy of O-NeuMF-p is higher than that of NeuMF.
Finally, we fuse CIDAE and O-NeuMF-p with reference to the idea of ensemble learning. We name the fused model the I-NMF (Imputation-Neural Matrix Factorization) model, which combines the advantages of CIDAE and O-NeuMF-p. The I-NMF model can not only alleviate the problem of data sparsity, but also fully exploit the ability of deep neural networks to learn potential features, and I-NMF is also robust. Furthermore, our experimental results confirm that the performance of I-NMF is better than CIDAE, O-NeuMF-p, and current advanced recommendation algorithms for the public MovieLens (ML) [19] datasets.
In summary, our main contributions of this work are outlined as follows: • We propose a new CIDAE model based on DAE. The advantage of CIDAE is that it alleviates the problem of data sparsity by using the idea of imputation. • By optimizing the training process of the NeuMF model, we improve the accuracy and robustness of NeuMF. We name the optimized NeuMF model the O-NeuMF-p model.

•
We fuse CIDAE and O-NeuMF-p with reference to the idea of ensemble learning. Additionally, we name the fused model the I-NMF model. The I-NMF model can not only alleviate the problem of data sparsity, but also fully exploit the ability of deep neural networks to learn potential features, and I-NMF is also robust.
This paper is organized as follows. In Section 2, we present the architecture and details of the CIDAE model. In Section 3, we briefly introduce the NeuMF model and explain in detail how we optimize its training process. In Section 4, we propose the idea of fusing CIDAE and O-NeuMF-p, and present the architecture and details of the I-NMF model. In Section 5, we evaluate all the models mentioned above using the public MovieLens datasets, and compare the results with current advanced recommendation algorithms. Concluding remarks with a discussion of some future work are in the final section.

CIDAE: Continuous Imputation Denoising Autoencoder
Our CIDAE consists of three parts: learning from the original data; continuous imputation for the missing parts of the original data; and learning from the imputed data. Figure 1 shows the structure of CIDAE.

CIDAE: Continuous Imputation Denoising Autoencoder
Our CIDAE consists of three parts: learning from the original data; continuous imputation for the missing parts of the original data; and learning from the imputed data. Figure 1 shows the structure of CIDAE.

Part 1: Learning from the Original Data
In the first part, there are many existing models [10][11][12][13][14] to choose from for learning the original data. In this paper, we chose the traditional autoencoder (AE) [20].
For each user , some items in the original data are of interest to , while the remaining items are the missing parts of the original data. The goal is to predict the items that are of interest to from the missing parts of the original data.
In this part, labels of the items that the is interested in are set to 1, and labels of the remaining items are set to 0. Similar to [16,21], labels for all items constitute the initial value of the user vector ( = 0 or 1). According to the AE, we use as the input and get the reconstructed by encoding and decoding. The specific process is as follows: Encoding: Map ∈ ℝ to the -dimensional hidden layer by encoding function .
where ∈ ℝ , ∈ ℝ , ∈ ℝ , is the total number of items and σ is a non-linear mapping function, e.g., a sigmoid function. Decoding: Map ∈ ℝ to the -dimensional space by decoding function for reconstructing . Let denote the reconstructed vector of .
where ∈ ℝ , ∈ ℝ , ∈ ℝ . We hope that and are as similar as possible, and the minimum objective function of AE is as follows: , , ,

Part 1: Learning from the Original Data
In the first part, there are many existing models [10][11][12][13][14] to choose from for learning the original data. In this paper, we chose the traditional autoencoder (AE) [20].
For each user u i , some items in the original data are of interest to u i , while the remaining items are the missing parts of the original data. The goal is to predict the items that are of interest to u i from the missing parts of the original data.
In this part, labels of the items that the u i is interested in are set to 1, and labels of the remaining items are set to 0. Similar to [16,21], labels for all items constitute the initial value of the user vector u i (u ij = 0 or 1).
According to the AE, we use u i as the input and get the reconstructed u i by encoding and decoding. The specific process is as follows: Encoding: Map u i ∈ R n to the d-dimensional hidden layer by encoding function f .
where W ∈ R d×n , b ∈ R d , y ∈ R d , n is the total number of items and σ is a non-linear mapping function, e.g., a sigmoid function. Decoding: Map y ∈ R d to the n-dimensional space by decoding function g for reconstructing u i . Let u i denote the reconstructed vector of u i .
Appl. Sci. 2019, 9,54 4 of 19 where W ∈ R n×d , b ∈ R n , u i ∈ R n . We hope that u i and u i are as similar as possible, and the minimum objective function of AE is as follows: where m is the total number of users and L is the loss function. Furthermore, we choose the cross entropy as the loss function. λ is a regularization parameter to prevent overfitting. Just like the traditional autoencoder, we optimize the objective function by using the gradient descent method.

Part 2: Continuous Imputation on the Missing Parts of the Original Data
In this part, CIDAE performs continuous imputation for the missing parts of the original data by using u i . Letû i ∈ R n denote the imputed vector of u i . For eachû ij inû i , we can use the following formula to calculate:û where f (x) can be any continuous function, and it can be a linear or nonlinear function; for example, where k is a hyper-parameter. Obviously, the larger the value of k, the larger the proportion of data that is imputed to 1. There are many choices for f (x), which shows that CIDAE has great flexibility.

Part 3: Learning from the Imputed Data
The traditional Denoising Autoencoder (DAE) first adds noise to the input data and then reconstructs the noise-added data. CIDAE performs regular imputation of the output data, which is different from DAE.
The specific process is similar to the Part1, as we still choose u i ∈ R n as the input and obtain the reconstructed u i by encoding and decoding. In Part 3, let u i ∈ R n denote the reconstructed vector of u i . We want the value of u i to be as similar as possible toû i , not u i , which is different from the Part1. The minimum objective function of CIDAE is as follows: Similar to [22], we introduce a hyper-parameter (α) in order to distinguish between the imputed data and the original data. We can distinguish the reconstruction error between the imputed data and the original data by using α as a weight. So, the final minimum objective function of CIDAE is as follows: Appl. Sci. 2019, 9, 54 5 of 19 In this way, CIDAE can adjust the loss weight of the imputed data through α. Just like the traditional denoising autoencoder, we optimize the objective function of the CIDAE by using the gradient descent method.
It is worth emphasizing that CIDAE trains the imputed dataû i as the desired output. On the one hand, the imputation method can alleviate the sparseness of the original data. On the other hand, a well-founded imputation is more meaningful than random noise. In addition, the CIDAE model, like the DAE algorithm, can increase the robustness of the model by introducing disturbances.

Optimization of the NeuMF Model
In this section, we briefly introduce the NeuMF model [1], and then explain in detail how to optimize NeuMF from the training process. The NeuMF model consists of GMF (Generalized Matrix Factorization) and MLP (Multi-Layer Perceptron). Figure 2 shows the structure of NeuMF.

Optimization of the NeuMF Model
In this section, we briefly introduce the NeuMF model [1], and then explain in detail how to optimize NeuMF from the training process. The NeuMF model consists of GMF (Generalized Matrix Factorization) and MLP (Multi-Layer Perceptron). Figure 2 shows the structure of NeuMF.

GMF (Generalized Matrix Factorization)
GMF has four layers: input layer, embedding layer, inner layer, and prediction layer (output layer). The input to GMF is a sparse user ID and item ID. The embedding layer is a fully connected layer that can project sparse vectors into dense vectors. The vector obtained by the embedding layer can be regarded as the hidden vector of the user (item). The hidden vector of user is denoted as , and the hidden vector of the item is denoted as . The mapping formula of the inner layer is as follows: where ⊙ represents the element-wise product of vectors. The operation of the last prediction layer (output layer) is where is the activation function. According to the literature [1], we adopt the sigmoid function ( ) as . is the weight of the prediction layer and is trained by the loss function.

MLP (Multi-Layer Perceptron)
MLP is designed to explore the interaction between users and items' latent features. Different from collaborative filtering of GMF, MLP is more flexible. MLP is not limited to the vector inner product, and can deeply learn the potential interaction between the user's hidden vector and the item's hidden vector . The specific process of MLP is as follows:

GMF (Generalized Matrix Factorization)
GMF has four layers: input layer, embedding layer, inner layer, and prediction layer (output layer). The input to GMF is a sparse user ID and item ID. The embedding layer is a fully connected layer that can project sparse vectors into dense vectors. The vector obtained by the embedding layer can be regarded as the hidden vector of the user (item). The hidden vector of user u is denoted as p u , and the hidden vector of the item v is denoted as q v . The mapping formula of the inner layer is as follows: where represents the element-wise product of vectors. The operation of the last prediction layer (output layer) isŷ where a out is the activation function. According to the literature [1], we adopt the sigmoid function σ(x) as a out . h is the weight of the prediction layer and is trained by the loss function.

MLP (Multi-Layer Perceptron)
MLP is designed to explore the interaction between users and items' latent features. Different from collaborative filtering of GMF, MLP is more flexible. MLP is not limited to the vector inner product, Appl. Sci. 2019, 9, 54 6 of 19 and can deeply learn the potential interaction between the user's hidden vector p u and the item's hidden vector q v . The specific process of MLP is as follows: where W x is the weight matrix of each layer, b x is the bias vector of each layer, and a x is the activation function of each layer. According to the literature [1,23], we adopt the Rectifier (ReLU) as a x . Theŷ uv is the output of the last prediction layer and σ (sigmoid function) is the activation function of the prediction layer. h is the weight of the prediction layer and is trained by the loss function. MLP selects the four-layers (L = 4) tower structure, and halves the layer size for each successive higher layer. Together with the final prediction layer, MLP has a total of five layers.

The NeuMF Model
NeuMF is a comprehensive model that combines GMF with MLP. NeuMF combines the training of GMF and MLP by adding the "NeuMF layer". The specific combination process is as follows: where φ GMF is the output result of the inner layer of GMF, and φ MLP is the output of MLP after four layers. The "NeuMF layer" can be understood as a combination of GMF and MLP from the prediction layer. σ (sigmoid function) is the activation function of the "NeuMF layer". h is the weight of the "NeuMF layer" and is trained by the loss function.

The NeuMF-p Model
NeuMF-p(pre-training) refers to the model that initializes NeuMF. NeuMF is initialized by pre-training GMF and MLP. The experiment in paper [1] shows that NeuMF-p is better than NeuMF. The specific initialization process of NeuMF is described in the following.
Firstly, GMF and MLP are separately pre-trained. Pre-training at this time means that the GMF(MLP) model is first randomly initialized, and the model is then trained by the loss function until the model converges. Secondly, the parameters of the pre-trained GMF and the pre-trained MLP are the initialization parameters of NeuMF-p. Finally, h is the weight of the "NeuMF layer", and the initialized value is as follows: where h GMF is the weight of the prediction layer of GMF and h MLP is the weight of the prediction layer of MLP. The γ is a hyper-parameter and the value of γ is set to 0.5 [1].
Considering that the experiment [1] shows that NeuMF-p is better than NeuMF, this paper chooses to optimize the training process of NeuMF-p directly. Here is a brief introduction to the loss function of NeuMF-p and the original training process. The loss function of NeuMF-p is as follows: This is the loss function of NeuMF-p, also known as the log loss or binary cross-entropy loss, which can be optimized by the SGD (Stochastic Gradient Descent) algorithm. The value of the target y uv is 0 or 1, which can be regarded as a label. If y uv = 1, it means that there is an interaction between user u and item v. If y uv = 0, it means that there is no interaction between user u and item v. By using the sigmoid function as the activation function of the output layer,ŷ uv can be controlled to output in the range of (0, 1), where y represents the set of observed interactions in dataset Y and y − represents the set of counterexamples in dataset Y. The counterexample in the original paper [1] refers to the data of unobserved interactions in dataset Y.

The O-NeuMF-p Model
We name the optimized NeuMF-p model the O-NeuMF-p model. This paper optimizes the training process of NeuMF-p from the counterexample label. The original paper [1] records the label y uv in the counterexample set y − as 0, which we think is unreasonable. Recommender systems focus on the data that has no interaction between users and items. This means that counterexamples are the focus of the research. If the label of the counterexample is marked as 0 when training, it is not conducive to learning the interaction in y − . This paper refers to the idea of DAE, adding noise to the original data and reconstructing the data. We change the label in the counterexample set y − from 0 to a random number. The specific change formula is as follows: where r is a hyper-parameter and can be an arbitrary number between [0, 1]. If r = 0, then y uv = random(0, 0) = 0. In this case, it is back to the original situation, where the label y uv in the counterexample set y − is marked as 0. If r = 1, then y uv = random(0, 1). This means that the label y uv in y − is a random number from 0 to 1. The idea of adding random numbers can make the label y uv in y − no longer equal to 0, which is equivalent to adding a disturbance to the data of y − . On the one hand, O-NeuMF-p is more robust than NeuMF-p. On the other hand, O-NeuMF-p is more focused on the learning of the y − data, which is exactly what we want. In addition, our experimental results show that under the same dataset, O-NeuMF-p has a higher recommendation accuracy and quality than NeuMF-p. This confirms the advantage of introducing random numbers in O-NeuMF-p from the side.

I-NMF: Fusion of CIDAE and O-NeuMF-p
On the one hand, although O-NeuMF-p performs better than NeuMF-p, it still does not solve the problem of data sparsity. On the other hand, although CIDAE can alleviate the problem of data sparsity, CIDAE only utilizes the idea of a single-layer neural network(AE) and does not fully exploit the ability of deep neural networks to learn potential features. Therefore, this paper fuses CIDAE and O-NeuMF-p with reference to the idea of ensemble learning. We name the fused model the I-NMF (Imputation-Neural Matrix Factorization) model, which combines the advantages of CIDAE and O-NeuMF-p. The I-NMF model can not only alleviate the problem of data sparsity, but also fully exploit the ability of deep neural networks to learn potential features, and I-NMF is also robust. Figure 3 shows the structure of I-NMF. the ability of deep neural networks to learn potential features. Therefore, this paper fuses CIDAE and O-NeuMF-p with reference to the idea of ensemble learning. We name the fused model the I-NMF (Imputation-Neural Matrix Factorization) model, which combines the advantages of CIDAE and O-NeuMF-p. The I-NMF model can not only alleviate the problem of data sparsity, but also fully exploit the ability of deep neural networks to learn potential features, and I-NMF is also robust. Figure 3 shows the structure of I-NMF. It is worth noting that there was a paper [24] combining the AE algorithm and the MF algorithm. The paper [24] uses the AE's feature layer as the initial value of the hidden vector of MF, but the combined model is still linear, like MF. The idea of this paper is not the combination of the feature , ∈ 1, It is worth noting that there was a paper [24] combining the AE algorithm and the MF algorithm. The paper [24] uses the AE's feature layer as the initial value of the hidden vector of MF, but the combined model is still linear, like MF. The idea of this paper is not the combination of the feature layer, but purely the combination of the prediction layer, which guarantees the integrity of CIDAE and O-NeuMF-p, so that both models can play their own advantages.

Specific Fusion Process
In data set O, m is the total number of users, and n is the total number of items. Additionally, the data of O is divided into a training set X and test set T .
On the one hand, the input of CIDAE trained with X is u i (u i ∈ R n , i ∈ [1, m]), and the output is u i . On the other hand, the input of O-NeuMF-p trained with X is a group of user IDs and item IDs ( u i , v j (i ∈ [1, m], j ∈ [1, n])), and the output is the predicted valueŷ u i v j (ŷ u i v j ∈ (0, 1)).
For each user u i in O, taking the top-N recommendation as an example, our goal is to use only the data in X and recommend the top-N items that may be of interest in the missing data. The accuracy and quality of the top-N recommendation is calculated by a comparison with the data in T .
For each user u i , firstly, we use the trained CIDAE to get the vector u i (u i ∈ R n ) of the user u i . Then, we use the trained O-NeuMF-p to get n predicted valuesŷ u i v j (j ∈ [1, n]) of the user u i . The vector consisting of these n predicted values is denoted as y i (y i ∈ R n ). In this way, we have the prediction vectors u i and y i corresponding to these two models for each user u i . The prediction vector for each user u i in I-NMF is denoted as x i (x i ∈ R n ). The value of x i is determined by the prediction vectors u i and y i , and the specific formula is as follows: where β is a hyper-parameter, and β ∈ [0, 1]. The introduction of β can adjust the fusion ratio of CIDAE and O-NeuMF-p. When β = 0, CIDAE is used for recommendation alone. Conversely, when β = 1, O-NeuMF-p is used for recommendation alone. We give the recommended top-N items based on the ranking of the value of x i in the missing data of the training set X . It is worth noting that the range of the value of x i is not necessarily between 0 and 1, but this does not affect the resolution of the top-N recommendation. Since our goal is to recommend top-N items, we only need to care about the ranking of the value of x i .
Our experimental results show that the accuracy and quality of the top-N recommendation of I-NMF is higher than that of CIDAE and O-NeuMF-p, and is also better than some of the current advanced recommendation algorithms. This confirms from the side that I-NMF can combine the advantages of the two models, and I-NMF achieves good recommendation results by using the idea of ensemble learning.

Experiments
There are three sets of experiments in this paper. They are the experiment of CIDAE, the experiment of O-NeuMF-p, and the experiment of I-NMF. In this section, we answer the following questions through the three sets of experiments: • Q1: Is CIDAE better than DAE and AE in the top-N recommendation? • Q2: Is O-NeuMF-p better than NeuMF-p in terms of the accuracy and quality of the recommended results? What is the degree of optimization of NeuMF-p? • Q3: Is I-NMF better than CIDAE and O-NeuMF-p in terms of the accuracy and quality of the recommended results? • Q4: Is I-NMF better than some of the current advanced recommendation models?

Experimental Datasets
We evaluate all models using the public MovieLens (ML) [19] datasets. We specifically select the ML-100K dataset and the ML-1M dataset in MovieLens (ML). These two datasets are commonly used for evaluating the performance of recommender systems [25,26]. The specific information of the two datasets is shown in Table 1. It is worth emphasizing that our experiments rely very little on datasets. Specifically, we only use user IDs, item IDs, and ratings. Each rating in the two datasets is an integer from 1 to 5. Since this paper discusses the top-N recommendation, we only need to focus on whether the user is interested in the item. So, similar to [1,16], we convert the rating to 0 or 1, indicating whether the user has rated the item.

Performance Metrics
We use Hit Ratio (HR), Normalized Discounted Cumulative Gain (NDCG) [27], and Mean Average Precision (MAP) to evaluate the performance of models. The value of HR can directly reflect the accuracy of the ranking. The larger the HR value, the higher the accuracy of the ranking. NDCG can account for the position of the hit by assigning higher scores to hits at top ranks [1]. The larger the NDCG value, the higher the ranking quality [2]. Mean average precision calculates the mean of users' average precision. The larger the MAP value, the better the effect of the model.
For each model, we calculate HR@N (N = 5, 10, 15), NDCG@N (N = 5, 10, 15), and MAP@N (N = 5, 10, 15) for each test user and report the average score. It is worth noting that N can be any value. In our paper, the reason we set the value of N to 5, 10, and 15 is to have a span to show the recommended effect of our model under different values of N.

Experimental Method
To ensure the reliability of the experimental data, we introduce five-fold cross-validation. In five-fold cross-validation, the original sample is randomly partitioned into five equal sized subsamples. Of the five subsamples, a single subsample is retained as the validation data for testing the model, and the remaining four subsamples are used as training data. The cross-validation process is then repeated five times, with each of the five subsamples used exactly once as the validation data. The five results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once.
We use five-fold cross-validation for all algorithms (including comparison algorithms) on the ML-100K dataset and the ML-1M dataset. Therefore, the experimental results of each algorithm are the average of five-fold cross-validation. It is worth noting that all our models' parameter experiments are performed only in a certain fold cross-validation process to avoid overfitting.
In order to prove that our algorithm is statistically better than other algorithms, rather than the result of experimental fluctuations, we compare the relative increase rate of the experimental results with the standard deviation of the five-fold cross-validation of our algorithm. For a more intuitive expression, we record the increase rate of the A algorithm relative to the B algorithm as A%B (specific calculation method is (A − B)/B). Furthermore, we record the standard deviation of the five-fold cross-validation of the A algorithm as A-SD. If the relative increase rate (A%B) is greater than the standard deviation (A-SD), our algorithm (A) is statistically superior to another algorithm (B).

Performance Comparison
The models used for comparison are as follows:  ) is a new probabilistic approach that directly incorporates user exposure to items into collaborative filtering. The exposure is modeled as a latent variable and the model infers its value from data.

Experiment of the CIDAE Model
In this part, firstly, we compare the CIDAE model with the DAE algorithm and the AE algorithm on the ML-100K dataset and the ML-1M dataset. Then, we analyze the influence of parameters of CIDAE on experimental results using the ML-100K dataset. Table 2 shows that the performance of the CIDAE model is better than the DAE algorithm and the AE algorithm on both datasets. Additionally, it can be found in Table 2 that CIDAE performs better on the ML-1M dataset. Compared with the AE algorithm, the advantage of CIDAE is that it can alleviate the problem of data sparsity by using the idea of imputation. Compared with the DAE algorithm, CIDAE does not add noise randomly, but performs regular continuous imputation on the output data. Therefore, the recommended ranking of CIDAE achieves a higher accuracy and better quality.
We also analyze the influence of parameters of CIDAE on experimental results on the ML-100K dataset. The specific parameters are as follows: 1.
N i : Number of iterations; 4. d: Number of hidden nodes; 5.
α: The loss weight of the imputed data.  Figure 4 uses NDCG@10 as an indicator to show the performance of various parameters on CIDAE in the ML-100K dataset. Specifically, Figure 4a shows that different imputation functions f (x) have different experimental results, but the difference is not obvious. So, we finally chose f (x) = kx because its form is relatively simple. Conversely, the value of k can significantly affect the experimental results and we finally chose f (x) = 1.4x because it works best. Figure 4b shows that when the number of iterations is 400, the accuracy of CIDAE is converged. Figure 4c shows that CIDAE achieves the highest accuracy when the number of hidden nodes is 200. So, we finally chose N i = 400 and d = 200. Figure 4d shows that when α = 0.1, CIDAE shows the highest accuracy. In addition, we can see that the accuracy is the worst when α = 1. This confirms that it is meaningful to calculate the loss of the imputed data separately with the weight α.
Appl. Sci. 2018, 8, x FOR PEER REVIEW 11 of 16 addition, we can see that the accuracy is the worst when α = 1. This confirms that it is meaningful to calculate the loss of the imputed data separately with the weight α.

Experiment of the O-NeuMF-p Model
In this part, firstly, we compare O-NeuMF-p with NeuMF-p on the ML-100K dataset and the ML-1M dataset. Then, we count the degree of optimization of NeuMF-p. Finally, we analyze the influence of the parameter of O-NeuMF-p on experimental results on both datasets. Table 3 shows that the performance of O-NeuMF-p is better than NeuMF-p on both datasets. Furthermore, it can be found in Table 3 that the optimization effect of NeuMF-p is more obvious on the ML-1M dataset. The O-NeuMF-p model is more focused on the learning of counterexamples than the NeuMF-p model, and is more robust. Additionally, the experimental results confirm that the idea of introducing random numbers into the label in is beneficial and improves the accuracy of NeuMF-p.

Experiment of the O-NeuMF-p Model
In this part, firstly, we compare O-NeuMF-p with NeuMF-p on the ML-100K dataset and the ML-1M dataset. Then, we count the degree of optimization of NeuMF-p. Finally, we analyze the influence of the parameter of O-NeuMF-p on experimental results on both datasets. Table 3 shows that the performance of O-NeuMF-p is better than NeuMF-p on both datasets. Furthermore, it can be found in Table 3 that the optimization effect of NeuMF-p is more obvious on the ML-1M dataset. The O-NeuMF-p model is more focused on the learning of counterexamples than the NeuMF-p model, and is more robust. Additionally, the experimental results confirm that the idea of introducing random numbers into the label y uv in y − is beneficial and improves the accuracy of NeuMF-p. We also analyze the effect of the hyper-parameter r controlling the range of random numbers on the accuracy of O-NeuMF-p. It is worth noting that when r = 0, the O-NeuMF-p model is transformed into the NeuMF-p model. Figure 5a shows that when r = 0.7, O-NeuMF-p achieves the highest accuracy on the ML-100K dataset. Figure 5b shows that when r = 0.6, O-NeuMF-p achieves the highest accuracy on the ML-1M dataset. Furthermore, it can be seen that on the ML-1M dataset, the value of r has little effect on the value of NDCG@10. It is worth emphasizing that for all values of r, the value of NDCG@10 of O-NeuMF-p is higher than that of NeuMF-p on the ML-100K dataset and the ML-1M dataset. This proves that O-NeuMF-p is more accurate than NeuMF-p on both datasets. We also analyze the effect of the hyper-parameter controlling the range of random numbers on the accuracy of O-NeuMF-p. It is worth noting that when = 0, the O-NeuMF-p model is transformed into the NeuMF-p model. Figure 5a shows that when = 0.7, O-NeuMF-p achieves the highest accuracy on the ML-100K dataset. Figure 5b shows that when = 0.6, O-NeuMF-p achieves the highest accuracy on the ML-1M dataset. Furthermore, it can be seen that on the ML-1M dataset, the value of has little effect on the value of NDCG@10. It is worth emphasizing that for all values of , the value of NDCG@10 of O-NeuMF-p is higher than that of NeuMF-p on the ML-100K dataset and the ML-1M dataset. This proves that O-NeuMF-p is more accurate than NeuMF-p on both datasets.

Experiment of the I-NMF Model
In this part, firstly, we test the performance of I-NMF and compare it with CIDAE and O-NeuMF-p on the ML-100K dataset and the ML-1M dataset. Then, we analyze the influence of the parameter of I-NMF on the experimental results on both datasets. Finally, we compare I-NMF with current advanced recommendation algorithms on both datasets. The experimental results prove that our I-NMF model performs better. In addition, we also count the running time of I-NMF and comparison algorithms. Table 4 shows that the performance of I-NMF is better than CIDAE and O-NeuMF-p on both datasets. On the one hand, the imputation ability of CIDAE can alleviate the sparseness of the original data. On the other hand, O-NeuMF-p uses the learning ability of deep neural networks to learn the interaction between users and items. Moreover, both CIDAE and O-NeuMF-p have a good robustness. It can be seen from the experimental results that CIDAE and O-NeuMF-p can complement each other to achieve a higher accuracy. Additionally, it is worth discussing that although CIDAE performs better on the ML-100K dataset, O-NeuMF-p performs better on the ML-1M dataset. This confirms the advantage of the powerful learning ability of deep neural networks for large datasets. In addition, this paper provides a new fusion idea; not only the combination of CIDAE and O-NeuMF-p, but also the combination of different types of recommendation algorithms.

Experiment of the I-NMF Model
In this part, firstly, we test the performance of I-NMF and compare it with CIDAE and O-NeuMF-p on the ML-100K dataset and the ML-1M dataset. Then, we analyze the influence of the parameter of I-NMF on the experimental results on both datasets. Finally, we compare I-NMF with current advanced recommendation algorithms on both datasets. The experimental results prove that our I-NMF model performs better. In addition, we also count the running time of I-NMF and comparison algorithms. Table 4 shows that the performance of I-NMF is better than CIDAE and O-NeuMF-p on both datasets. On the one hand, the imputation ability of CIDAE can alleviate the sparseness of the original data. On the other hand, O-NeuMF-p uses the learning ability of deep neural networks to learn the interaction between users and items. Moreover, both CIDAE and O-NeuMF-p have a good robustness. It can be seen from the experimental results that CIDAE and O-NeuMF-p can complement each other to achieve a higher accuracy. Additionally, it is worth discussing that although CIDAE performs better on the ML-100K dataset, O-NeuMF-p performs better on the ML-1M dataset. This confirms the advantage of the powerful learning ability of deep neural networks for large datasets. In addition, this paper provides a new fusion idea; not only the combination of CIDAE and O-NeuMF-p, but also the combination of different types of recommendation algorithms.
We also analyze the effect of the hyper-parameter β controlling the fusion ratio on the accuracy of I-NMF. When β = 0, CIDAE is used for recommendation alone. Conversely, when β = 1, O-NeuMF-p is used for recommendation alone. Figure 6a shows that when β = 0.6, I-NMF achieves the highest accuracy on the ML-100K dataset. At this time, I-NMF is better than CIDAE and O-NeuMF-p. Figure 6b shows that when β = 0.8, I-NMF achieves the highest accuracy on the ML-1M dataset. It is worth emphasizing that for all values of β, I-NMF is better than CIDAE and O-NeuMF-p on the ML-1M dataset. This also confirms that CIDAE and O-NeuMF-p can complement each other very well.

ML-1M
O- We also analyze the effect of the hyper-parameter controlling the fusion ratio on the accuracy of I-NMF. When = 0, CIDAE is used for recommendation alone. Conversely, when = 1, O-NeuMF-p is used for recommendation alone. Figure 6a shows that when = 0.6, I-NMF achieves the highest accuracy on the ML-100K dataset. At this time, I-NMF is better than CIDAE and O-NeuMFp. Figure 6b shows that when = 0.8, I-NMF achieves the highest accuracy on the ML-1M dataset. It is worth emphasizing that for all values of , I-NMF is better than CIDAE and O-NeuMF-p on the ML-1M dataset. This also confirms that CIDAE and O-NeuMF-p can complement each other very well. Finally, we compare I-NMF with current advanced recommendation algorithms on both datasets. Table 5 shows that the performance of I-NMF is best on both datasets. On the ML-100K dataset, compared to the state-of-the-art method of ExpoMF, on average, I-NMF obtains 6.8%, 6.9%, and 9.1% relative improvements in HR, NDCG, and MAP metrics, respectively. On the ML-1M dataset, compared to the state-of-the-art method of ExpoMF, on average, I-NMF obtains 5.4%, 5.3%, and 6.0% relative improvements in HR, NDCG, and MAP metrics, respectively. These prove that the accuracy of our I-NMF model is very high.  Finally, we compare I-NMF with current advanced recommendation algorithms on both datasets. Table 5 shows that the performance of I-NMF is best on both datasets. On the ML-100K dataset, compared to the state-of-the-art method of ExpoMF, on average, I-NMF obtains 6.8%, 6.9%, and 9.1% relative improvements in HR, NDCG, and MAP metrics, respectively. On the ML-1M dataset, compared to the state-of-the-art method of ExpoMF, on average, I-NMF obtains 5.4%, 5.3%, and 6.0% relative improvements in HR, NDCG, and MAP metrics, respectively. These prove that the accuracy of our I-NMF model is very high.
In addition, we also count the running time (a fold cross-validation process) of I-NMF and comparison algorithms in Table 6. It is worth noting that our experiment is run under the CPU (Processor Intel Core i7-6700k, Memory 16GB), and if it can use the GPU to run, it may be faster.
Some parameters of our I-NMF model can affect the running time, such as: N i (Number of iterations) and d (Number of hidden nodes) in CIDAE, the number of layers of MLP in O-NeuMF-p, and so on. But the hyper-parameters β (controlling the fusion ratio), r (controlling the range of random numbers), and α (the loss weight of the imputed data), and so on, do not affect the runtime. The training time in our model is relatively large. However, during prediction, the running time is negligible, where it only takes 18.2 s for the ML-100K dataset and 2 min for the ML-1M dataset. In addition to this, we can optimize the training time in the future.

Conclusions and Future Work
Firstly, we propose a new CIDAE model based on DAE. The advantage of CIDAE is that it alleviates the problem of data sparsity by using the idea of imputation. Additionally, our experimental results show that CIDAE is more accurate than the AE algorithm and the DAE algorithm.
Then, we optimize the advanced NeuMF model based on the training process. O-NeuMF-p is more focused on the learning of counterexamples than NeuMF-p, and is more robust. Furthermore, the experimental results confirm that the idea of introducing random numbers into the label y uv in y − is beneficial and improves the accuracy of NeuMF-p.
Finally, this paper fuses CIDAE and O-NeuMF-p with reference to the idea of ensemble learning. I-NMF can not only alleviate the problem of data sparsity, but also fully exploit the ability of deep neural networks to learn potential features, and I-NMF is also robust. It can be seen from the experimental results that CIDAE and O-NeuMF-p can complement each other to achieve a higher accuracy. Moreover, our experimental results prove that I-NMF performs better than current advanced recommendation algorithms on both datasets.
It is worth emphasizing that this paper provides a new fusion idea; not only the combination of CIDAE and O-NeuMF-p, but also the combination of different types of recommendation algorithms.
In the future, there are two directions to extend our work. On the one hand, we can try more ways to combine different kinds of recommendation algorithms to achieve a better accuracy. On the other hand, we can try to incorporate auxiliary extra data into recommender systems, such as social relations, review text, type of items, personal information of users, and so on. We can further improve the accuracy of the model by adding extra valuable data.