Implicit Stochastic Gradient Descent Method for Cross-Domain Recommendation System

The previous recommendation system applied the matrix factorization collaborative filtering (MFCF) technique to only single domains. Due to data sparsity, this approach has a limitation in overcoming the cold-start problem. Thus, in this study, we focus on discovering latent features from domains to understand the relationships between domains (called domain coherence). This approach uses potential knowledge of the source domain to improve the quality of the target domain recommendation. In this paper, we consider applying MFCF to multiple domains. Mainly, by adopting the implicit stochastic gradient descent algorithm to optimize the objective function for prediction, multiple matrices from different domains are consolidated inside the cross-domain recommendation system (CDRS). Additionally, we design a conceptual framework for CDRS, which applies to different industrial scenarios for recommenders across domains. Moreover, an experiment is devised to validate the proposed method. By using a real-world dataset gathered from Amazon Food and MovieLens, experimental results show that the proposed method improves 15.2% and 19.7% in terms of computation time and MSE over other methods on a utility matrix. Notably, a much lower convergence value of the loss function has been obtained from the experiment. Furthermore, a critical analysis of the obtained results shows that there is a dynamic balance between prediction accuracy and computational complexity.


Introduction
Recent achievements in the Internet and computing technologies have made it possible for organizations to collect, store, and process large amounts of data. These data contain detailed information related to the behaviors of users. Accurately, they represent the set of user evaluations for specific items. For example, Amazon (https://www.amazon.com/) collects information about the user's habits shopping-wise, or even regarding surfing on their website. Netflix (https://www. netflix.com/) also has substantial data related to the subject of movies. These data are beneficial for recommending useful decisions when supporting their clients. In this scenario, each firm designs a unique and maximally efficient system that can recommend as pleasant as possible items to its customers [1]. Nevertheless, not all users give ratings for items that they like or dislike. This limitation causes the fragmentation of the dataset obtained from the user, which is called data sparsity. In the real-life, a dataset is sparse at around 0.05% [2]. Therefore, the system can not produce useful recommendations when a new user or item has entered the system due to the insufficient previous ratings. This problem is named the cold-start [3], which is the most challenging issue for researchers • We propose an efficient framework for a cross-domain recommendation system based on a constrained optimization model. In our model, the optimal solution and computation time are simultaneously taken into consideration. • We devise an approximation algorithm that is suitable for objective function optimization in a cross-domain related problem. In particular, an implicit updating technique is applied to improve convergence time. • We conduct extensive experiments on two real-world datasets to validate the effectiveness and efficiency of our method. The results demonstrate that the proposed framework can achieve better performance in comparison with the previous approach.
The remainder of the paper is organized as follows. Section 2 explains the background knowledge of MFCF in a single domain and reviews literature related to CRDS. Section 3 formally defines the problem formulation. In Section 4, we present our conceptual framework for CDRS. Section 5 presents an experiment. Finally, we draw conclusions and suggest directions for future study in Section 6.

Related Work
Recent researchers have studied cross-domain related work, as mentioned in [11,16,17], wherein there are two types of cross-domain recommended tasks. The first task is to use the information of the source domain to enhance the quality of the target domain recommendation [18][19][20]. Karatzoglou et al. used a machine learning method to transfer dense knowledge from the source domain to the target areas, which is much more sparse [21]. Enrich et al. used the user tags as connections between multiple domains, from which they learn the users' rating models to gain performance in the target domain [22]. The second task is recommending items in separate domains concurrently. They proposed a method for creating a rating matrix, which is the multidisciplinary shared latent factor [23,24]. Shi et al. [25] used the user-generated tags to calculate the similarity between cross-domain users and items, respectively, and then integrated these similarities into a matrix factorization model to improve the recommended accuracy. Gao et al. presented the clustering latent factor model based on a joint non-negative matrix framework [26].
For recommendation using matrix factorization, work was done by Gogna et al. [27]. They proposed a matrix completion framework that can be implemented in different domains. Zhenzhen et al. presented a cross-domain recommendation algorithm to overcome cold-start and sparsity problems and mentioned that this could be extended to consider temporal dynamics, as user preferences may change over time [28]. A cross-domain collaborative framework for recommending the venue proposed by Farseev et al. [29] is not able to address the cold-start problem. Loni et al. [30] presented a cross-domain factorization machine that can exploit additional knowledge from an auxiliary domain by encoding specific knowledge from a domain in terms of the real-valued feature vector.
In this study, we apply the MFCF for multiple domains using an updated technique to increase the convergence time of the objective function. Additionally, the implicit stochastic gradient descent-based algorithm is utilized to apply to the cross-domain recommendation system.

Background
In a single domain, let us suppose there are M users and N items. The relationship between the users and the items is presented by the user-item rating matrix Y ∈ R M×N , called utility matrix. Any rating r ij in Y is subject to r ij ∈ {1, 2, 3, 4, 5, ?}, where "?" represents missing value. To predict the missing values, users and items are clustered. The utility matrix Y can be factorized into two matrices Y ≈Ŷ = XW T , where X ∈ R M×K is the user-group membership matrix, and W ∈ R N×K is the item-group membership matrix. Figure 1 represents the matrix factorization, in which the full utility matrix Y is decomposed into two matrices X and W, where K is much smaller than M, N. Each row in X represents a user profile x, and each column in W denotes an item profile w. On the other hand, the i-th item and the j-th user are represented by the i-th and j-th rows of the two matrices as W i * and X j * . After matrix factorization, the users and items are mapped to a latent factor feature of a lower dimensionality K. To predict the missing values in the utility matrix, the low-rank matrix factorization is approximated as an optimization problem given by where L is the loss function of the predicted ratings f (X, W) and the original ratings Y, R(X, W) is the regularization term, and λ is the regularization tradeoff parameter. λR(X, W) is regularization component to avoid overfitting. Regarding probabilistic matrix factorization (BMF) [31,32], the objective function to measure the loss with regularization terms and a Frobenius norm is expressed as where I is the rating indicator matrix, I ij ∈ {0, 1}. I ij = 1 indicates that the rating is observed, or I ij = 0 otherwise. denotes the Hadamard product [33] of the matrices.

Definition of User-Preference Matrix
Let D l with l ∈ (1, L) be the user-preference matrix with response to l-th domain. Then, the entries of D l which are denoted by (D l ) ij indicate the ratings of the i-th user for the j-th items of set D l .
By U we denote the set of all users that exist in multiple domains: where U D l is the sets of users in l-th domain. Although these matrices are overlapping or nonoverlapping, the matrix V, which is built from the consolidation of matrices D 1 , D 2 , ... D L has the number of rows given by |U |. Given a user U u , u ∈ {1, 2, . . . , |U |} in U , the matrix D l can be rewritten as follows: where the row vector d X u , X ∈ {D 1 , D 2 , D L }, contains the corresponding rating values of all items in For generality, all the user's ratings can be described by the following matrix V: Matrix V is the expandable matrix since its dimensionality increases when adding new items and users to the data. We denote the transpose matrix of V by V T .

A column vector b is given as
where n denotes the number of rows (the number of users) of V. Each entry b i is the inverse of a square root of the element a ii in the VV T diagonal. Therefore, b i is as follows: Then it is possible to write formula (6) in the following form By b T we denote the transform matrix of b, and matrix B = bb T . In this regard, a similarity matrix S can be written as follows: This operator is a Hadamard product, in which each element p, q in S is the product of elements p and q of the original two matrices (VV T ) and B. After this operator, S will be the symmetric matrix with rows and columns being users, in which each element S ij is the cosine of the angle between two vectors u i and u j , where u i and u j are the i-th and j-th rows of V, respectively. Remark 1. By considering the preference vectors u i and u j , i = j, the similarity between i-th and j-th users is properly given by S ij ; i.e., Given by the i-th row and j-th column entry of matrix, VV T is equal to u i u T j , with the corresponding entry B ij of matrix B equivalent to Therefore, (9) is considered as a generalized formulation to derive the similarity matrix among users with respect to all items.
Similarly, we can find the item-similarity matrix as follows.
where C = cc T , c is a row vector c = [c 1 , c 2 . . . , c m ] with c m denoting the inverse of a square root of the element d mm in the V T V diagonal of V. Therefore, c m is as follows: Now we will factorize matrix V. As mentioned in the previous Section, the user n gives a rating to the item m that can be approximated as y mn = x T m w n . However, the actual ratings have biases for users or/and items, since users tend to rate the items according to their rating behaviors, resulting in ratings that may be larger or smaller than the actual values the items receive. We use bias to overcome this problem. By µ m and µ n , we denote biases for item m and user n, respectively. Then the rating is approximated by where µ is median value of all ratings. Therefore, the loss function (2) can be written as In the previous works, this loss function is solved by optimizing one of the pairs (X, µ m ) and (W, µ n ) respectively, while fixing the other pair. This process is repeated until the loss function converges. This push-pull [34] gradient method will get a sub-optimal solution [35]. In contrast with the earlier investigations, we will solve this loss function by optimizing (X, W, µ m , µ n ) simultaneously.

Algorithm for Prediction Error Minimization
In this section, we investigate the following joint design problem for prediction error model minimization: where t [t mn ], ∀m ∈ {1, . . . , M}, ∀n ∈ {1, . . . , N}, with t mn satisfying the following constraint: Although the objective function in (16) is a quadratic representative, which is convex, constraint (17) is still non-convex. To efficiently solve this problem, we derive a successive convex program based on an inner approximation method [36] as follows: It is observed that (17) is equivalent to the convex constraint: with the following constraint imposed However, constraint (19) is still non-convex. Inspired from ( [37], Lemma 1), (19) can be approximated asw which is convex as a second order cone constraint. Here,x mk andw nk are respectively the values of x mk and w nk at the previous iteration. Therefore, the successive convex program is formulated as minimize X,W,µ m ,µ n ,t,u subject to (18), (20).
It is realized that the problems in (21) can be efficiently solved per iteration by the existing solver (e.g., SPDT3 [38], MOSEK [39], or SeDuMi [40]), so that we obtain at least a locally optimal solution at the convergence. The algorithm for solving problem in (21) is briefly described in Algorithm 1.  Solve (21) to obtain (x * , µ * m , w * , µ * n ) and L (k+1) .

end if 12 end for
In Algorithm 1, we use the implicit update technique to increase the convergence speed. The initial values of x and w are random. For the practical implementation, Algorithm 1 terminates upon reaching L (k+1) − L (k) < ε after a finite number of iterations [41].

CDRS Framework
In this section, we propose a conceptual framework for a cross-domain recommender system that applies the proposed method [42]. When businesses launch multiple products or services, a mass number of data are processed to make the recommendations to clients. These data are heterogeneous and imbalanced, since their sources are from different domains. Data from users, such as ratings, number of likes, and website surfing history, are collected, clustered, and stored into the database. The cross-domain recommendation system engine will process these data to build the model. A set of parameters could be adjusted at this stage to obtain the best accuracy. The system output is the user-preferences prediction that is used to recommend items to the customers.
Particularly, according to the Figure 2, multiple datasets from various domains are preprocessed in a knowledge transfer module. Here, these data will have similarities identified and latent features extracted, and we will perform knowledge transformation. Then in the next phase, the prediction model will analyze all the information exported from the preprocessing phase in order to apply the appropriate algorithms for training and prediction generation. In this phase, most parameters are turned repeatedly to choose the best set for maximizing the whole system's accuracy. By this workflow, a CDRS can deal with heterogeneous input data and produce recommendation items in various scenarios.

Experiments
In this section, we report experiments done to evaluate the recommendation quality of the proposed recommendation model against some baseline state-of-the-art recommendation techniques.

Dataset
To better illustrate our method, this section outlines a small-scale example. There were two datasets used: Movielens (https://movielens.org/) and Amazon Food (https://www.kaggle.com/ snap/amazon-fine-food-reviews). The statistical information for these datasets is presented in Table 1. As shown in Table 1, the movielens100k dataset includes 943 users with 90,570 ratings for 1675 items. Therefore, its sparsity is extremely high (0.057%). Similarly, the Amazon food dataset is sparse, at around 0.058%. This sparsity is natural with respect to real-world situations in recommendation services [43]. The remaining unknown ratings are a big challenge for the recommender system to predict.
We chose three other related algorithms to compare with the proposed algorithm: • The rating matrix generative model (RMGT): [23] one of the most popular algorithms for testing cross-domain recommended performance. • The singular value decomposition-based MF (SVD) [44].
• The SVD++-based MF (SVD++) [45] is an extension of the SVD considering implicit ratings.
For each algorithm, we used gradient descent and implicit stochastic gradient descent, respectively, for optimization.

Evaluation Metric
We adopt the mean square error (MSE) to measure the accuracy of predicted ratings, which measures the sum of squared distances between our target ratings and predicted values. MSE is defined as follows: Additionally, we use mean absolute error (MAE) which has frequently been used to compare prediction errors of recommendation methods. This measurement is defined as follows: where n denotes the number of tested ratings, y i is real ratings, and y p i is predicted ratings. This approach is used because the predicted rating values create an ordering across the items in which the predictive accuracy can also be used to measure the ability of a recommendation system to rank items with respect to user preference [46].
We use k-fold cross-validation to split the dataset. A k-fold cross-validation is where a given dataset is split into a k number of sections/folds where each fold is used as a testing set at some point. To select a proper k is important since a poorly chosen value may cause a misrepresentation of the methods. In this experiment, k is set as 10, because 5 and 10 have empirically shown to yield test error rate estimates that suffer neither from excessively high bias nor very high variance, according to [47]. Here, the dataset is split into ten folds. In the first iteration, the first fold is used to test the model, and the rest is used to train the model. In the second iteration, the second fold is used as the testing set, while the rest serves as the training set. This process is repeated until each fold of the ten folds has been used as the testing set. As we repeat the process k times, we get k times mean square error (MSE). MSE 1 , MSE 2 , . . . MSE k , so k-fold cross-validation error is computed by taking average of the MSE over k folds.

Baseline
A matrix factorization method is applied to solve the problem in (15). Eventually, we have to optimize the loss function L. An optimized method based on the gradient descent algorithm is used to solve this problem. Notably, four variables will be separated into two pairs. For each iteration, one of the pairs is kept constant, while the other is optimized [48]. This process repeats sequentially until convergence is achieved based on the push-pull gradient. After convergence, the sub-optimal solution can be obtained. This solution is used as a baseline.

Experiment Parameters
Two optimization methods are used for comparison: gradient descent (GD) and implicit stochastic gradient descent (ISGD) [49,50]. The set of parameters is presented in Table 2. We have chosen these parameters based on a series of empirical tests.

Evaluation and Discussions
Now we solve the problem in this paper by optimizing all the variables simultaneously. The implicit stochastic gradient descent (ISGD) method is applied. Firstly, it is necessary to transform the original problem in (15) into the convex problem [51] formulated in (21). The parameters listed in Table 2 are the same as the baseline case. To deal with a vast quantity of variables, it is required to apply some techniques for accelerating convergence rate. Algorithm 1 shows the updating step in each iteration. Figure 3 shows the typical convergence behavior of the algorithms for the loss function minimization problem. As a result, ISGD needs only a few iterations to reach the convergence value. Moreover, its convergence value is much lower in comparison with the baseline.
When K varies from 10 to 40, as shown in Figure 3. The slope of the ISGD convergence line also changes accordingly. When K is more extensive, this slope also increases. This leads to the initial value of the objective function also increasing significantly. The results showed the larger the K selected, the higher the objective value obtained at the first iteration. When K is larger, the dimensions of x, w increase accordingly. This leads to an increase in the number of elements in x, w that makes their values larger. Finally, the value of the objective function will be larger. However, the convergence value is approximately the same. Let K be 10; the results according to changing the value of λ from 0.01 to 0.1 are shown in Figure 4. It shows the difference in the convergence rates when we change the regularization parameter λ. When λ is small, the objective value is obtained as a small value at the first iteration, and the convergence rate is slow. Nevertheless, with the higher λ is selected, a higher objective value is obtained at the first iteration accordingly, and the convergence value is reached faster. When K is increased (e.g., 20,30,40) and λ value is set as the highest value (0.1), the initial objective value is much larger since it is affected by two factors, and the convergence value is reached faster.
We recognize that the parameter K is used to adjust the approximation process. It acts the role of the dimension for approximation. The bigger K is the more accurate approximation. Nevertheless, when K increases, the value of the objective function will increase accordingly. It will be a penalty since it has a norm of x and w. This leads to a trade-off problem between the MSE and the computation complexity. K can not be so large, and the MSE has to be as small as possible. Regarding convergence time, we have measured the time until convergence between GD and ISDG. The results are shown in Table 3. In Figure 5, we can notice that the proposed method shows efficiency in terms of reducing computation time. On average, computation time has been reduced 15.2%.  Furthermore, the proposed method shows a significant result regarding prediction accuracy. Table 4 and Figure 6 show an MAE comparison between our method and other techniques. It shows that the effect of the method in this paper is better than that of other comparison methods on all tests. That is, the experimental result shows that using the ISGD technique to optimize the objective function in MFCF improves the performance of the cross-domain recommendation system.  When K varies from 10 to 40, the implicit update techniques show its efficiency to increase the convergence time. Unfortunately, the objective function has a norm of x and w, which can lead to a trade-off problem between the MSE and the computation time. Additionally, our goal is to make the MSE to be as small as possible, so K can not be so large. This issue will be the limitation of our paper. We have to make a balance between the accuracy of the recommender system and the computation time.

Conclusions and Future Works
In this paper, we proposed a new method to consolidate multiple matrices from multiple domains for building a cross-domain recommendation system. After the consolidation, the matrix was factorized by using MFCF. The problem was to maximize the accuracy of the prediction of unknown ratings of users. To address the design problem, we transformed the original problem into sub-problems of lower dimensions. Then the iterative algorithm was proposed based on the inner approximation method to solve the sequence of convex programs. We applied the implicit stochastic gradient descent method for implicit updating each iteration. Our method with realistic parameters monotonically improved the objective function, and the convergence to a stationery point is guaranteed. Through the experiment, we demonstrated the usefulness of our approach in improving the accuracy of the CDRS.
As future work, we plan to consider using multiple data that have different distributions and attributes to test the performance of a cross-domain recommendation system. Based on this way, we can investigate the appropriate set of parameters for each specific type of data or type of domain in general.