Metric Factorization with Item Cooccurrence for Recommendation

: In modern recommender systems, matrix factorization has been widely used to decompose the user–item matrix into user and item latent factors. However, the inner product in matrix factorization does not satisfy the triangle inequality, and the problem of sparse data is also encountered. In this paper, we propose a novel recommendation model, namely, metric factorization with item cooccurrence for recommendation (MFIC), which uses the Euclidean distance to jointly decompose the user–item interaction matrix and the item–item cooccurrence with shared latent factors. The item cooccurrence matrix is obtained from the colike matrix through the calculation of pointwise mutual information. The main contributions of this paper are as follows: (1) The MFIC model is not only suitable for rating prediction and item ranking, but can also well overcome the problem of sparse data. (2) This model incorporates the item cooccurrence matrix into metric learning so it can better learn the spatial positions of users and items. (3) Extensive experiments on a number of real-world datasets show that the proposed method substantially outperforms the compared algorithm in both rating prediction and item ranking.


Introduction
In the era of information overload, recommendation systems are important for addressing the problem of information explosion. Collaborative filtering technology is an early, widely used, and influential recommendation technology, but its performance is severely degraded by data sparseness [1][2][3]. In the past decade, the factorization of user-item matrices into user and item latent factor vectors has been widely studied and has become a popular method for matrix factorization models. Furthermore, it not only has high prediction accuracy but can also well integrate and decompose additional side information. However, its performance will be affected by the choice of the inner product [4][5][6] because the inner product in matrix factorization does not satisfy the triangle inequality: "the distance between two points cannot be larger than the sum of their distances from a third point" [7]. This will limit the expressive power of the matrix factorization and lead to locally optimal solution problems, thereby reducing the flexibility and generalization performance of the matrix factorization model. The metric learning method of the Euclidean distance is more suitable for the learning of latent factors [8][9][10]. Metric factorization is based on the learning of user-item factors based on the Euclidean distance. It is important to encode the user-item interaction matrix for the construction of a distance matrix that is suitable for the learning of the Euclidean distance. Although metric learning has overcome the shortcomings of matrix factorization, in the case of sparse data, learning only the latent factors of users and items remains insufficient. Therefore, we propose the metric factorization with item cooccurrence model, in which the item cooccurrence matrix is introduced into the metric learning process so that the user-item interaction matrix and the item-item

Matrix Factorization
In recommendation systems, matrix factorization is a popular and effective recommendation method and is the standard in modern recommendation systems. The successful implementation of many potential factor models is based on matrix factorization. The most basic matrix factorization strategy is to decompose the rating matrix into latent factors of users and items. By learning to establish the relationship between users and items, the accuracy of recommendation prediction is improved [11,12]. With the development of recommendation systems, many variants of matrix factorization have been derived, and user and item bias terms have been introduced to improve the prediction accuracy of the model [13]. A graph probability model was introduced to better adapt to the real data sparse environment [14,15]. The authors in [16] proposed a novel quality of service prediction approach based on probabilistic matrix factorization, which has the capability of incorporating network location and implicit associations among users and services. Mature algorithms also utilize SVD ++ [17] and timeSVD [18]. Another special processing method is to push the unobserved user-item pairs away from the observed user-item pairs from the Bayesian perspective to solve the problem of item ranking [19].

Item Embedding
The item cooccurrence matrix that was developed in this paper was inspired by the word embedding model. In the word embedding model, each word is represented by a real vector [20]. Word2vec is a popular word embedding method. For a specified series of training words, its embedding model learns the potential factors of each word. For example, in [21], the surrounding words of a specified word are predicted during training. The study by [22] also used the word embedding model to build item embedding models for learning prediction, and [23] introduced item embedding into the matrix factorization model, and the performance was substantially improved. Reference [24] added not only user and item embeddings, but also items that users do not like into the matrix factorization model for prediction, and the accuracy of the prediction was also improved. Therefore, we introduced item cooccurrence in this paper, and we processed the user-item matrix to construct the item cooccurrence matrix, which better facilitates metric learning.

Metric Factorization with Item Cooccurrence (MFIC) Model
First, we reviewed two basic frameworks for creating MFIC models: metric factorization for recommendation beyond matrix factorization (FML) and word embedding. Then, we describe how our MFIC model and calculus are calculated.

Factorized Metric Learning (FML )Model
The FML model is a model for metric learning that uses the Euclidean distance. First, the user rating matrix R ∈ R m×n is transformed into a distance matrix R 1 ∈ R 1 m×n , the distance matrix is obtained via Equation (1).
Max Similarity is the maximum value of the rating matrix (e.g., 5) or implicit feedback (e.g., 1). In the metric vector space, we denote the positions of users and items as P u ∈ R k and Q i ∈ R k , respectively. The main optimization loss functions of FML are as follows.
In rating prediction, Equation (3) was selected as the prediction distance, where b u and b i represent user and item biases, respectively, while µ represents global biases. Super-parameter τ is added in front of µ to scale and obtain a more accurate prediction value. c ui is a self-confidence mechanism for ensuring that extreme ratings are assigned higher self-confidence values. Equation (4) was selected as the predictive distance when ranking items. c ui is the self-confidence mechanism of the observed items.

Word Embedding
The word embedding model has realized substantial success in natural language processing and has received increasing attention. Word embedding is a generalization of language modeling and representation learning technology in natural language processing that mainly maps all the dimensions into the high latitude of each word or phrase of the real field vector that is embedded into a low-dimensional vector space. In the popular word2vec [20], a set of words are specified and each word is embedded from a high-dimensional domain vector into a low-dimensional vector space. Finally, the skip-gram model in word2vec is used to predict the words around it in a fixed window. For example, as shown in Figure 1, we selected the word "my" as the input word, and set skip_window = 2, where skip_window = 2 represents selecting the left two words and the right two words of the input word "my" to enter our window, and obtain the training data of four groups. First, we reviewed two basic frameworks for creating MFIC models: metric factorization for recommendation beyond matrix factorization (FML) and word embedding. Then, we describe how our MFIC model and calculus are calculated.

Factorized Metric Learning (FML )Model
The FML model is a model for metric learning that uses the Euclidean distance. First, the user rating matrix ∈ × is transformed into a distance matrix 1 ∈ 1 × , the distance matrix is obtained via Equation (1).
Max Similarity is the maximum value of the rating matrix (e.g., 5) or implicit feedback (e.g., 1). In the metric vector space, we denote the positions of users and items as ∈ and ∈ , respectively. The main optimization loss functions of FML are as follows.
In rating prediction, Equation (3) was selected as the prediction distance, where and represent user and item biases, respectively, while represents global biases. Super-parameter is added in front of to scale and obtain a more accurate prediction value.
is a self-confidence mechanism for ensuring that extreme ratings are assigned higher self-confidence values. Equation (4) was selected as the predictive distance when ranking items.
is the self-confidence mechanism of the observed items.

Word Embedding
The word embedding model has realized substantial success in natural language processing and has received increasing attention. Word embedding is a generalization of language modeling and representation learning technology in natural language processing that mainly maps all the dimensions into the high latitude of each word or phrase of the real field vector that is embedded into a low-dimensional vector space. In the popular word2vec [20], a set of words are specified and each word is embedded from a high-dimensional domain vector into a low-dimensional vector space. Finally, the skip-gram model in word2vec is used to predict the words around it in a fixed window. For example, as shown in Figure 1, we selected the word "my" as the input word, and set skip_window = 2, where skip_window = 2 represents selecting the left two words and the right two words of the input word "my" to enter our window, and obtain the training data of four groups. According to [25] and [26], the skip-gram model with negative sampling is equivalent to implicit factorization of a word context matrix in which entry is the pointwise mutual information (PMI) of  According to [25,26], the skip-gram model with negative sampling is equivalent to implicit factorization of a word context matrix in which entry is the pointwise mutual information (PMI) of the corresponding word and context, which is shifted by a global constant. Let D be the set of observed words and context pairs. The PMI between word i and its word context j is calculated as PMI(i, j) = log((P(i, j)/(P(i)P( j)))) (5) where #() represents the frequency of words, for example, #(i, j) represents the frequency of the simultaneous occurrence of the words i and j in order to calculate the probability P(i, j) of the two. P(i, j) denotes the probability that word i and word j appears simultaneously in a fixed window, P(i) represents the probability of occurrence of word i in set D, and P( j) is the probability of word j appearing alone in set D. Substituting Equations (6)-(8) into Equation (5) yields the following expression (9): PMI can be constructed as a matrix of size m × n, namely, matrix M PMI , where m is the number of elements in set D. Next, the shifted positive pointwise mutual information (SPPMI) of words i and j is calculated as: Here, k is a hyperparameter, which can control the matrix density of the PMI and has an inverse proportional relationship, namely, the larger the value of k, the higher the sparsity of matrix PMI. The main advantage is that optimization adjustments are unnecessary. The above is the complete process of word embedding.

MFIC Model
Inspired by word embedding, the colike item matrix is created via word embedding. We can think of the items called by the user as the words in the word embedding, therefore, we can create the colike item matrix according to word embedding and use it to identify the item latent factors. In addition, this colike item matrix is symmetrical about the diagonal of the matrix. As illustrated in Figure 2, I1 and I2 are called by U1, U3, and U4 in the user-item matrix simultaneously; therefore, the corresponding value of item1 and item2 in the colike matrix is 3. I1 and I4 have not been called by any same user. Hence, the corresponding value is empty. This matrix was generated and merged with metric learning in this paper. The rating matrix was calculated and used to find the item that is called by each user, which is equivalent to using #(i) and #( j) in word embedding to search for the item that is consumed by the corresponding two users, which is equivalent to word embedding #(i, j). Before constructing the item cooccurrence SPPMI matrix, the mutual information of each pair of points must be calculated via Equation (5). Then, the shifted positive pointwise mutual information of item-item pairs is calculated via Equation (10) from the obtained pointwise mutual information. According to the colike item matrix in Figure 2, #(I1, I2) = 3, #(I1 = 3), and #(I2) = 2, |D| = 8, and after calculation, the mutual information value of I1 and I2 is 0.60, as presented in the item embedding matrix of Figure 3. Finally, it is embedded into the metric learning model to highlight the item's expressiveness and to enhance the relationships between users and items and between items and items.   To use the metric vector space to learn user and item positions via factorization, it is necessary to convert the user rating matrix into a distance matrix to improve the learning in the metric space. The distance matrix can be constructed from the explicit distance matrix and the implicit distance matrix. A checkmark in the explicit matrix indicates that the user has invoked the item, and a cross sign indicates that the item has not been invoked. The transformation of the explicit distance matrix obeys the following transformation rule.
where Max Similarity is the maximum value in the rating matrix such as five. The transformation of the implicit distance matrix satisfies the following equality.
In the implicit case, the Similarity(u,i) is 1 or 0, while parameter in the formula is used to control the balance between the user and the item.    To use the metric vector space to learn user and item positions via factorization, it is necessary to convert the user rating matrix into a distance matrix to improve the learning in the metric space. The distance matrix can be constructed from the explicit distance matrix and the implicit distance matrix. A checkmark in the explicit matrix indicates that the user has invoked the item, and a cross sign indicates that the item has not been invoked. The transformation of the explicit distance matrix obeys the following transformation rule.
where Max Similarity is the maximum value in the rating matrix such as five. The transformation of the implicit distance matrix satisfies the following equality.
In the implicit case, the Similarity(u,i) is 1 or 0, while parameter in the formula is used to control the balance between the user and the item.   To use the metric vector space to learn user and item positions via factorization, it is necessary to convert the user rating matrix into a distance matrix to improve the learning in the metric space. The distance matrix can be constructed from the explicit distance matrix and the implicit distance matrix. A checkmark in the explicit matrix indicates that the user has invoked the item, and a cross sign indicates that the item has not been invoked. The transformation of the explicit distance matrix obeys the following transformation rule.
where Max Similarity is the maximum value in the rating matrix such as five. The transformation of the implicit distance matrix satisfies the following equality.
In the implicit case, the Similarity(u,i) is 1 or 0, while parameter β in the formula is used to control the balance between the user and the item.  To use the metric vector space to learn user and item positions via factorization, it is necessary to convert the user rating matrix into a distance matrix to improve the learning in the metric space. The distance matrix can be constructed from the explicit distance matrix and the implicit distance matrix. A checkmark in the explicit matrix indicates that the user has invoked the item, and a cross sign indicates that the item has not been invoked. The transformation of the explicit distance matrix obeys the following transformation rule.
where Max Similarity is the maximum value in the rating matrix such as five. The transformation of the implicit distance matrix satisfies the following equality.
In the implicit case, the Similarity(u,i) is 1 or 0, while parameter in the formula is used to control the balance between the user and the item.

Evaluation for Rating Prediction
The MFIC model combines metric factorization and item cooccurrence and simultaneously performs the position learning of the latent factors of users and items in the metric space. The difference between metric factorization and item cooccurrence is that metric factorization infers the form for encoding a user's preference for an item, whereas an item embedding must be interpreted from the item cooccurrence model. The overall model of the MFIC rating prediction is illustrated in Figure 3.
Equation (13) is the objective function of the MFIC model, and α is used as the weight coefficient for weighting the Y1 and Y2 losses so that the model finds the optimal value faster during loss learning. The last term in the equation is the regularization term, λ is the regularization term parameter, and || ||, || ||, and || || are set to || || < , || || < , and || || < , which can control the || ||, || ||, and || || unit spheres, respectively, in the L2-norm to spread the data points less widely and to facilitate multidimensional complexity treatment. Equation (14) expresses a learning method for the spatial positions of users and items that use the Euclidean distance in the metric space. In the metric vector space, we denote the positions of the user and the item as ∈ and ∈ . Equation (15) represents the predicted value of the rating that is generated by the user and the item and by the item and the embedded item, and it enhances the connection between the user and the item. γ is a hyperparameter for controlling the balance between the user and the item, and the item and the embedded item. In matrix factorization [13], some items are popular and easily obtain high ratings, while some users habitually assign low ratings to items. Therefore, similar to matrix factorization, biases are added to metric learning to improve the stability and expressiveness of the model. and represent the user bias and the item bias, respectively. μ is a global bias, which can be multiplied by a hyperparameter to improve the performance of the model. Equation (16) predicts the newly added item embedding. The prediction between the item and the embedded item can highlight the performance of the item, and is the bias of the embedded item. Equation (17) is a self-confidence mechanism that assigns a high degree of self-confidence to extreme ratings [27]. g(*) can be an absolute value function, a square function, or a logarithmic function. It can be selected according to the requirements of the model. θ is a scaling parameter of the self-confidence mechanism that is used to control the degree of self-confidence in rating.

Evaluation for Rating Prediction
The MFIC model combines metric factorization and item cooccurrence and simultaneously performs the position learning of the latent factors of users and items in the metric space. The difference between metric factorization and item cooccurrence is that metric factorization infers the form for encoding a user's preference for an item, whereas an item embedding must be interpreted from the item cooccurrence model. The overall model of the MFIC rating prediction is illustrated in Figure 3.
Equation (13) is the objective function of the MFIC model, and α is used as the weight coefficient for weighting the Y 1 and Y 2 losses so that the model finds the optimal value faster during loss learning. The last term in the equation is the regularization term, λ is the regularization term parameter, and ||P u ||, ||Q i ||, and ||Q i1 || are set to ||P u || < c, ||Q i || < c, and ||Q i1 || < c, which can control the ||P u ||, ||Q i ||, and ||Q i1 || unit spheres, respectively, in the L2-norm to spread the data points less widely and to facilitate multidimensional complexity treatment. Equation (14) expresses a learning method for the spatial positions of users and items that use the Euclidean distance in the metric space. In the metric vector space, we denote the positions of the user and the item as P u ∈ R k and Q i ∈ R k . Equation (15) represents the predicted value of the rating that is generated by the user and the item and by the item and the embedded item, and it enhances the connection between the user and the item. γ is a hyperparameter for controlling the balance between the user and the item, and the item and the embedded item. In matrix factorization [13], some items are popular and easily obtain high ratings, while some users habitually assign low ratings to items. Therefore, similar to matrix factorization, biases are added to metric learning to improve the stability and expressiveness of the model. b u and b i represent the user bias and the item bias, respectively. µ is a global bias, which can be multiplied by a hyperparameter to improve the performance of the model. Equation (16) predicts the newly added item embedding. The prediction between the item and the embedded item can highlight the performance of the item, and b i1 is the bias of the embedded item. Equation (17) is a self-confidence mechanism that assigns a high degree of self-confidence to extreme ratings [27]. g(*) can be an absolute value function, a square function, or a logarithmic function. It can be selected according to the requirements of the model. θ is a scaling parameter of the self-confidence mechanism that is used to control the degree of self-confidence in rating.

Evaluation for Ranking Prediction
Similarly, the item cooccurrence is introduced into the item ranking model to improve the item ranking performance in the personalized recommendation system. In the process of personalized recommendation item ranking, implicit data processing outperforms explicit data prediction. In previous studies, binary processing is typically used for implicit data [28][29][30]. Therefore, the explicit Symmetry 2020, 12, 512 7 of 18 data were also implicitly processed in this paper. For example, a rating that is greater than or equal to 3.5 will be represented as 1, and a rating that is less than 3.5 will be represented as 0. The setting of the rating threshold will be elaborated in the experimental part.
The use of Equation (18) is consistent with the prediction of the rating. The largest difference is that c ui in Equation (19) is different. c ui is still a self-confidence mechanism in item ranking, and θ is a scaling hyperparameter. W ui represents the number of times the user has responded to positive feedback regarding the item. For example, if the user calls the item three times, W ui = 3 . This is more conducive to users being closer to their favorite items and farther away from the items that they do not like. In addition, the embedding of the item not only highlights the item's expressiveness, but also increases the connections between users and items.

Optimization and Prediction
Dropout was added into the MFIC model training. Dropout is an effective method in neural networks for dealing with the fitting process [31], therefore, this paper used dropout to prevent the overfitting of models in Euclidean distance learning for user and item latent factors. In addition, for the model loss function learning, a loss learning model, namely, Adagrad [32], was adopted, which can adapt the learning step size according to the update frequency of the model and reduce the frequency of parameter adjustment. Finally, because the rating matrix is converted from a distance matrix at the beginning of the model, the predicted distance matrix must be reversed for rating prediction, and for item ranking, the closeness of the item to the user depends on the predicted distance.

Experimental Evaluation
We studied the performances of the MFIC models in rating prediction and item ranking, and we used various datasets and evaluation indicators to measure and evaluate the performances of MFIC models in rating prediction and item ranking to determine the impacts of model parameters on the model performance, to compare the performances of MFIC models with those of other recommendation methods, and to analyze the experimental results.

Preparation for the Rating Prediction Experiments and Presentation of the Experimental Result
The datasets for rating prediction are Movielens-100K and Movielens-1M [27]. During the experiment, the datasets were randomly divided into training sets and test sets according to the ratio of 9:1. The sparsity is the number of existing ratings divided by the number of users and the number of items. Details on the datasets are presented in Table 1. The selection of important parameters of the MFIC model has a substantial influence on the prediction performance. The training order of the parameters is r, c, N, τ, θ, d, γ, λ, α, s, and k. First, all parameters were set on the basis of the FML model27. Second, inspired by the parameter settings section in [33], one of the parameters was trained and a set of values was selected for iterative training. , and the pointwise mutual information value k was selected from [0, 1,3,5,7,9]. Finally, the next parameter was trained on the basis of the optimal parameters. For example, when the training parameter was c, the remaining parameters were fixed and r was set to the optimal value that had been trained. The stopping criterion is that the evaluation indicator root mean squared error (RMSE) was not further reduced. We obtained the optimal setting as follows: Movielens-100K (r = 0.02, c = 1.0, N = 150, τ = 0.8, θ = 0.2, d = 0.05, γ = 0.002, λ = 0.01, α = 0.01, s > 1, and k ≥ 1). In the same way, the optimal setting is as follows: Movielens-1M (r = 0.02, c = 1.4, N = 150, τ = 0.5, θ = 0.1, d = 0.03, γ = 0.002, λ = 0.01, α = 0.01, s > 3.5, and k ≥ 0.5). Figure 6 plots the effects of only the important parameters on the performance of the MFIC model during training on Movielens-100K. Figure 6a presents the tuning of the learning rate to the model parameters. After continuous training, when the learning rate was r = 0.02, the predicted rating error was the smallest in terms of mean average error (MAE) or root mean squared error (RMSE). The clip value is a range of spatial positions that control the user and item latent factors. A suitable clip value can effectively address spatial multidimensional problems. According to Figure 6b, c = 1.0 is the most suitable. The number of dimensions N determines the number of latent vectors; too large a value will increase the complexity of the model, whereas too small a value will reduce the expressiveness of the features. Therefore, according to Figure 6c, when N = 150, the performance of the model is optimal. The parameter τ can be regarded as a scaling super-parameter of the global bias term that enables the model to find the accurate prediction value. From Figure 6d, it can be concluded that the super-parameter τ facilitates the improvement of the model performance and the prediction error was reduced when τ = 0.8. Figure 6e shows that when θ = 0.2, the rating prediction error was the smallest. When θ = 0, the performance of the model was drastically reduced. The confidence value θ is indispensable in this model. Dropout was utilized to prevent latent user and item factors from being overfit during training. According to Figure 6f, the value that was selected by the dropout rate was too large or too small to help the model. Therefore, d = 0.05 was selected. The weight loss of the prediction function was used to balance the contributions of users and items, and items and embedded items to the rating prediction (see Equation (14)). Figure 6g shows that γ = 0.002 was the best choice for the model. The regularization term λ was added to prevent overfitting of the model. According to Figure 6h, λ = 0.01 should be set. Inspired by [34], this paper also added the loss function weight α to the model to control each loss function term. According to Figure 6i, α = 0.01 should be selected. The selection of important parameters of the MFIC model has a substantial influence on the prediction performance. The training order of the parameters is r, c, N, τ, θ, d, γ, λ, α, s, and k. First, all parameters were set on the basis of the FML model27. Second, inspired by the parameter settings section in [33], one of the parameters was trained and a set of values was selected for iterative training.  1,2,3,4], and the pointwise mutual information value k was selected from [0, 1,3,5,7,9]. Finally, the next parameter was trained on the basis of the optimal parameters. For example, when the training parameter was c, the remaining parameters were fixed and r was set to the optimal value that had been trained. The stopping criterion is that the evaluation indicator root mean squared error (RMSE) was not further reduced. We obtained the optimal setting as follows: Movielens-100K (r = 0.02, c = 1.0, N = 150, τ = 0.8, θ = 0.2, d = 0.05, γ = 0.002, λ = 0.01, α = 0.01, s > 1, and k ≥ 1). In the same way, the optimal setting is as follows: Movielens-1M (r = 0.02, c = 1.4, N = 150, τ = 0.5, θ = 0.1, d = 0.03, γ = 0.002, λ = 0.01, α = 0.01, s > 3.5, and k ≥ 0.5). Figure 6 plots the effects of only the important parameters on the performance of the MFIC model during training on Movielens-100K. The rating threshold s was used to select data when constructing the item cooccurrence matrix such as s > 1, which caused all ratings that exceeded 1 in the dataset to be selected, namely, only the data with ratings of 2, 3, 4, and 5 were selected for item cooccurrence. In the construction of the matrix, the ratings that have little effect on the model are removed, and the complexity of the model is reduced. Figure 7a shows that when the selection rating threshold satisfies s > 1, the error of the rating prediction was the smallest and the error of the previous rating prediction was reduced. The pointwise mutual information value was calculated according to Equation (8). According to Equation (9), k is the setting choice of the pointwise mutual information value in the item cooccurrence matrix; for example, if k ≥ 3, the culling pointwise mutual information value is in the range of 0 ≤ k < 3, and the model complexity can be reduced again. According to Figure 7b, a pointwise mutual information value that satisfies k ≥ 3 should be selected. To make the experimental results more accurate and representative, this paper used two evaluation indicators (MAE and RMSE) in the rating prediction, conducted six evaluations for each comparison method, and averaged the results. To evaluate the performance of MFIC in rating prediction, this paper considered the following comparison algorithms: bayesian probabilistic matrix factorization (BPMF) is a probabilistic matrix factorization model with a full Bayesian processing method that uses the Markov chain Monte Carlo method to train the model [15]. Neural rating regression (NRR) is a prediction rating model that is based on the deep learning framework neural rating tips (NRT) [35]. Neural network matrix factorization (NNMF)uses a multilayer neural network to change the inner product of matrix factorization to realize rating prediction [36]. FML uses the Euclidean distance instead of the inner product of matrix factorization for metric space learning and rating prediction.
It can be concluded from Table 2 that the MFIC model outperformed all the comparison algorithms in rating prediction and realized satisfactory prediction performance on datasets Movielens-1M and Movielens-100K, which differ in terms of density. In addition, from the data in Table 2, it can be seen that the evaluation indicators MAE and RMSE of the algorithm BPMF on the dataset Movielens-1M were better than the algorithm NRR, but in the dataset Movielens-100K, the opposite was true, and MFIC in these two dataset both of them have achieved good experimental results, indicating that the MFIC model can also show relatively stable performance on these two datasets with different densities. Thus, the introduction of item cooccurrence can improve the rating prediction performance.

Model
Movielens-1M Movielens-100K To make the experimental results more accurate and representative, this paper used two evaluation indicators (MAE and RMSE) in the rating prediction, conducted six evaluations for each comparison method, and averaged the results. To evaluate the performance of MFIC in rating prediction, this paper considered the following comparison algorithms: bayesian probabilistic matrix factorization (BPMF) is a probabilistic matrix factorization model with a full Bayesian processing method that uses the Markov chain Monte Carlo method to train the model [15]. Neural rating regression (NRR) is a prediction rating model that is based on the deep learning framework neural rating tips (NRT) [35]. Neural network matrix factorization (NNMF) uses a multilayer neural network to change the inner product of matrix factorization to realize rating prediction [36]. FML uses the Euclidean distance instead of the inner product of matrix factorization for metric space learning and rating prediction.
It can be concluded from Table 2 that the MFIC model outperformed all the comparison algorithms in rating prediction and realized satisfactory prediction performance on datasets Movielens-1M and Movielens-100K, which differ in terms of density. In addition, from the data in Table 2, it can be seen that the evaluation indicators MAE and RMSE of the algorithm BPMF on the dataset Movielens-1M were better than the algorithm NRR, but in the dataset Movielens-100K, the opposite was true, and MFIC in these two dataset both of them have achieved good experimental results, indicating that the MFIC model can also show relatively stable performance on these two datasets with different densities. Thus, the introduction of item cooccurrence can improve the rating prediction performance.

Item Ranking Experiment Preparation and Experimental Results
The dataset that was used for item ranking was composed of two datasets, FilmTrust and EachMovie [27]. The dataset was divided into a training set and a test set according to the ratio of 8:2. Details on the dataset are presented in Table 3. Table 3. Details of the datasets FilmTrust and EachMovie. All parameters were tuned just like the rating prediction in Section 4.1, and we stopped training the model in the item ranking when the evaluation indicator Precision@5 showed no further improvement. We obtained the optimal setting as follows:  Figure 8. In Figure 9, the value that corresponds to each column in Figure 9a is the result of one round of tuning. Each value is an evaluation index, as shown in Figure 9b, as in Figure 8. Additional information is presented in Figure 9.

Number of Items
In Figure 8, the optimal values for each parameter were θ = 0.7, c = 1, r = 0.1, N = 100, α = 0.01, d = 0.05, and γ = 0.005. For hyperparameters θ, c, and r, the meanings of N, α, d, and γ were the same as in the rating prediction and will not be explained here. The distance scaling hyperparameter β is the minimum distance for which the control setting was negative for the user. As shown in Figure 8d, β = 2.5 was the optimal value for sorting the items.
The rating threshold s and the pointwise mutual information value k in Figure 9 were the same as those in the rating prediction and will not be explained here. As shown in Figure 9, when s > 2.5, the performance of the model was optimal. When k ≥ 5, the pointwise mutual information value was optimal for the model.
To accurately evaluate the item ranking performance of the MFIC model, this paper used several evaluation indicators such as the mean average precision (MAP), the mean reciprocal rank (MRR), the normalized discounted cumulative gain (NDCG), Recall@n, and Precision@n. The algorithm was trained six times and the average value was obtained to yield more representative experimental results. To evaluate the performance of the MFIC model, the following comparison algorithms were considered. The neural matrix factorization (NeuMF) model adopts a processing method in which a multilayer perceptron and matrix factorization are combined and used for item sequencing tests [4]. Collaborative denoising auto-encoders (CDAE) is a model with more flexible components that uses the strategy of automatic denoising encoders [37]. Weighted regularized matrix factorization (WRMF) is an item ranking test model with positive and negative preferences [2]. FML uses the Euclidean distance instead of the matrix decomposition inner product to learn the metric space and item rankings [27].
According to Table 4, the MFIC model also outperformed all comparison algorithms in item ranking, even on two datasets that differed in terms of density, namely, FilmTrust and EachMovie. Hence, MFIC can adapt to the environment of sparse data in the item ranking prediction. According to the experimental results, the metric learning used by MFIC far outperformed methods that use matrix factorization algorithms such as NeuMF and WRMF. In addition, the FML algorithm with metric learning can also be used for item ranking, but its prediction result was less accurate than the MFIC prediction result. Thus, the introduction of item cooccurrence can also substantially facilitate item ranking.

Item Ranking Experiment Preparation and Experimental Results
The dataset that was used for item ranking was composed of two datasets, FilmTrust and EachMovie [27]. The dataset was divided into a training set and a test set according to the ratio of 8:2. Details on the dataset are presented in Table 3. All parameters were tuned just like the rating prediction in Section 4.1, and we stopped training the model in the item ranking when the evaluation indicator Precision@5 showed no further improvement. We obtained the optimal setting as follows:  Figure 8. In Figure 9, the value that corresponds to each column in Figure 9a is the result of one round of tuning. Each     In Figure 8, the optimal values for each parameter were θ = 0.7, c = 1, r = 0.1, N = 100, α = 0.01, d = 0.05, and γ = 0.005. For hyperparameters θ, c, and r, the meanings of N, α, d, and γ were the same as in the rating prediction and will not be explained here. The distance scaling hyperparameter β is the minimum distance for which the control setting was negative for the user. As shown in Figure  8d, β = 2.5 was the optimal value for sorting the items.
The rating threshold s and the pointwise mutual information value k in Figure 9 were the same as those in the rating prediction and will not be explained here. As shown in Figure 9, when s > 2.5, the performance of the model was optimal. When k ≥ 5, the pointwise mutual information value was optimal for the model.
To accurately evaluate the item ranking performance of the MFIC model, this paper used several evaluation indicators such as the mean average precision (MAP), the mean reciprocal rank (MRR), the normalized discounted cumulative gain (NDCG), Recall@n, and Precision@n. The algorithm was trained six times and the average value was obtained to yield more representative experimental results. To evaluate the performance of the MFIC model, the following comparison algorithms were considered. The neural matrix factorization (NeuMF) model adopts a processing method in which a multilayer perceptron and matrix factorization are combined and used for item sequencing tests [4]. Collaborative denoising auto-encoders (CDAE) is a model with more flexible components that uses the strategy of automatic denoising encoders [37]. Weighted regularized matrix factorization (WRMF) is an item ranking test model with positive and negative preferences [2]. FML uses the Euclidean distance instead of the matrix decomposition inner product to learn the metric space and item rankings [27].

Conclusions
In this paper, we proposed a metric factorization method with item cooccurrence for recommendation, which mainly uses the word embedding strategy in natural language processing to conduct item cooccurrence and metric factorization learning in a recommendation system. First, the item cooccurrence matrix was constructed via the calculation of the pointwise mutual information value. Then, the user-item matrix and the item cooccurrence matrix were converted into the corresponding distance matrix, and the obtained distance matrix was decomposed into a metric space via the Euclidean distance for latent user, item, and embedded item factor spatial position learning. Finally, the performance in rating prediction and item ranking was evaluated. The experimental results on two datasets for rating prediction and item ranking show that the performance of the MFIC model has been substantially improved to that of other recommendation algorithms.