Leveraging User Comments for Recommendation in E-Commerce

: Collaborative ﬁltering recommender systems traditionally recommend products to users solely based on the given user-item rating matrix. Two main issues, data sparsity and scalability, have long been concerns. In our previous work, an approach was proposed to address the scalability issue by clustering the products using the content of the user-item rating matrix. However, it still su ﬀ ers from these concerns. In this paper, we improve the approach by employing user comments to address the issues of data sparsity and scalability. Word2Vec is applied to produce item vectors, one item vector for each product, from the comments made by users on their previously bought goods. Through the user-item rating matrix, the user vectors of all the customers are produced. By clustering, products and users are partitioned into item groups and user groups, respectively. Based on these groups, recommendations to a user can be made. Experimental results show that both the inaccuracy caused by a sparse user-item rating matrix and the ine ﬃ ciency due to an enormous amount of data can be much alleviated.


Introduction
Recommender systems [1][2][3] are able to analyze the past behavior of customers and recommend the products in which they might be interested. Recommender systems can roughly be categorized into two types: collaborative filtering and content-based filtering. Content-based filtering [4][5][6] assumes that customers will buy things that are similar to what they have bought in the past. Therefore, detailed information about products and users are required for recommendations. However, the information needed is growing harder to get in the modern age of privacy awareness. Collaborative filtering [7][8][9][10][11][12][13][14][15][16][17] assumes that similar users have similar interest in items and similar items have similar ratings by users. Traditionally, users give their preference ratings for the previously purchased products, and these ratings are maintained in a user-item rating matrix. Based on the content of the matrix, collaborative filtering makes recommendations to a user according to the opinions of other like-minded users on the products. In general, collaborative filtering is simpler and more practical, and it tends to be more appealing in the E-commerce community.
However, there exist some issues with collaborative filtering. Two of them are data sparsity and scalability [18][19][20]. Data sparsity is related to the sparse ratings in the user-item rating matrix, and it can lead to inaccurate recommendations. Scalability is related to the huge number of products or/and users involved, which may cause an unacceptably long delay before valuable recommendations are acquired. In this paper, we propose a novel approach to deal with data sparsity and scalability. First of all, we apply Word2Vec [21,22], which is a word semantic tool developed by Google, to analyze the user comments on their previously bought goods. Each word is assigned a unique vector that represents its semantics. A set of item vectors, one item vector for each product, is then developed. Through the user-item rating matrix, the user vectors of all the users are obtained. Principal component analysis and an iterative self-constructing clustering algorithm are applied to reduce the time complexity related to the large numbers of items and users. Then, recommendation work is done with the resulting clusters. Finally, reverse transformation is performed, and a ranked list of recommended items can be offered to each user. With the proposed approach, the inaccuracy caused by the sparse ratings in the user-item rating matrix is overcome, and the processing time for making recommendations from an enormous amount of data is much reduced.
The proposed approach is an extension of our previous work published in [23]. Some major advantages over the previous work are provided. Word2Vec is applied on user comments to find the similarity relationship among products. Therefore, the sparsity problem is alleviated, and the accuracy of recommendation is improved. An iterative self-constructing clustering algorithm is used. The clusters obtained are less dependent on the order of the training patterns presented to the clustering algorithm, leading to more robust recommended results. Finally, clustering is applied not only on the products but also on the users. Consequently, the scalability problem can be further alleviated.
The rest of this paper is organized as follows. In Section 2, collaborative filtering, particularly our previous work [23], is briefly introduced. Some other related work is reviewed and discussed in Section 3. The proposed approach is described in detail in Section 4. Experimental results are presented in Section 5. Finally, a conclusion is given in Section 6.

Collaborative Filtering
Suppose there is a set of N users, u i , 1 ≤ i ≤ N, and a set of M products, p j , 1 ≤ j ≤ M. A user u i may express his/her evaluation to a product p j by providing a rating r ij , which is a positive integer, for p j . Usually, a higher rating is assumed to indicate a more favorable feedback from the user. If user u i has not provided a rating for product p j , r ij = 0. Such information can be represented by the following user-item rating matrix R: which is an N × M matrix. Traditionally, collaborative filtering recommends products to each user solely based on such a user-item rating matrix. The idea is simple, but issues of data sparsity and scalability may arise.
In [23], we proposed an approach to address the problem of scalability. Each product p i , 1 ≤ i ≤ M, is represented as a N-vector, which is the ith column in R. A self-constructing clustering algorithm is applied to divide the products into a collection of g clusters C 1 , C 2 , · · · , C g , with g ≤ M. Each cluster is regarded as one item group. Therefore, g item groups, denoted I 1 , I 2 , · · · , I g , respectively, are obtained. Then a M × g transformation matrix T is formed, with the (i, j)th entry t ij denoting the membership degree of product p i to item group I j , 1 ≤ i ≤ M, 1 ≤ j ≤ g.
Next, the original user-item rating matrix R is transformed to a reduced matrix B by which is an N × g matrix. Based on B, a g × g correlation matrix is constructed. Then, ItemRank is applied, and a predicted preference list of item groups, h i , is iteratively derived for each user u i . Since recommendation is done with B, which is of size N × g, rather than with R, which is of size N × M, efficiency is improved. Finally, the reverse transformation is performed: which is the predicted preference list of products for user u i , 1 ≤ i ≤ N. However, collaborative filtering recommender systems may encounter difficulty due to sparsity. Consider Table 1, which is a small user-item rating matrix, consisting of 7 users and 7 items. Clearly, there are many missing ratings in R. Table 1. A small user-item rating matrix R.
Note that, because of sparsity, the rows of R are hardly similar to each other. Recommendations based on the similarity between the users will not be accurate at all. In addition, because of sparsity, the columns of R are hardly similar to each other. It is impossible to reduce the number of items by clustering techniques based on the similarity between the items. It is also impossible to reduce the number of users based on the similarity between the users. Therefore, data sparsity makes it hard in this case to get recommendations right and efficiently by collaborative filtering through similarity between the users or/and items.

Related Work
Various methods have been proposed to address to address the issues of data sparsity and scalability associated with collaborative filtering. In matrix factorization methods [24,25], an M × N user-item rating matrix is decomposed into two matrices of size M × K and K × N, respectively. In this way, K groups are obtained. Bobadilla et al. [18] use a Bayesian non-negative matrix factorization method and apply K-means to get better parameters for matrix factorization. These methods have a drawback in common. When encountering a heavily sparse user-item rating matrix, the groups obtained are not reliable. As a result, the resulting recommendations are not accurate and may not be useful. Zhang et al. [26] introduce the concepts of popular items and frequent raters. They assume that the ratings match some probability model. To overcome the data sparsity and rating diversity, smoothing and fusion techniques are employed. However, the method may encounter difficulties with a user-item rating matrix having a high degree of sparsity.
Clustering-based methods [2,23,[27][28][29][30][31][32][33] have been proposed to address the scalability issue. Users or/and items are clustered into groups, and thus the numbers of users or/and items are reduced. Park [34] proposes a method that identifies tail items, having only a few ratings, from head items, having an enough number of ratings, according to their popularities. The recommendations for tail items are based on the ratings of clustered groups, while the recommendations for head items are based on the ratings of individual items or groups clustered to a lesser extent. However, this method tends to recommend tail items as a whole to users. Das et al. [30] use the DBSCAN clustering algorithm for clustering the users, and then implement voting algorithms to recommend items to the user depending on the cluster into which it belongs. The idea is to partition the users and apply the recommendation algorithm separately to each partition. Allahbakhsh and Ignjatovic [35] propose an iterative method that regards all the users as one user. The ratings by individual users are integrated together according to each user's credibility, which is iteratively updated. This can reduce the sparsity difficulty. However, it cannot give personalized recommendations to each user. In our previous work [23], we apply a self-clustering algorithm to cluster products into item groups. However, the users are not clustered. This may encounter the scalability problem when a huge number of users are involved.
Incorporating information from sources other than the user-item rating matrix may help to overcome the sparsity problem [6,20,[36][37][38]. Victor et al. [39] consider the trust and distrust relationship between users when doing rating prediction by collaborative filtering. The information about the trust and distrust relationship is gathered from the "useful" or "not useful" tag in the user text comments on items. If a user gives a useful tag in another user's comments, a positive relationship will be considered to exist between them. Forsati et al. [40] also consider the trust and distrust relationship between users, taken from social networks, to mitigate the sparsity issue. They improve the matrix factorization method by adding a condition to the optimization function in which the trust and distrust information is embedded. Huang et al. [41] apply the user-item rating matrix, user social networks, and item features extracted from the DBpedia knowledge base to cluster users and items into multiple groups simultaneously. By merging the predictions from each cluster, the top-n recommendations to the target users are returned. Zheng et al. [42] propose a matrix factorization method that takes advantages of user comments to items. The comments are converted to item vectors with elements corresponding to the predefined keywords. If a keyword appears in the comments about an item, the corresponding element of the item vector is set to 1; otherwise, it is set to 0. One difficulty of this method is that the keywords have to be manually defined. Barkan et al. [43] use Word2Vec to analyze the metadata of items and construct item vectors. Then, items are clustered, and the top-k similar items can be found. However, for most of these methods, accessing user relationships from social networks is getting harder when the awareness of privacy is soaring nowadays. Besides, the information obtained indirectly through friends in social networks may not always be relevant. For instance, one may have a friend who has a really different lifestyle, contradicting the basic assumption for the trust relationship.

Proposed Approach
Suppose, from some other sources, we know that p 1 , p 2 , p 3 , and p 4 in Table 1 are similar to each other, and so are p 5 , p 6 , and p 7 . Then, we can somehow cluster these seven products into two item groups, say, p g 1 and p g 2 , and transform the matrix to something similar to the one shown in Table 2. Assume that the ratings in this table are obtained by taking the maximum of the original product ratings. Clearly, there exist two groups of similar users, namely, {u 1 , u 2 , u 3 , u 4 } and {u 5 , u 6 , u 7 }, and recommendations through the similarity between users can be done by collaborative filtering. Table 2. The user-item rating matrix after a clustering of products.
Motivated by this idea, we propose a novel approach to overcome the inaccuracy and inefficiency caused by data sparsity and scalability. Our approach consists of five steps: developing item and user vectors, grouping for items and users, getting a reduced user-item rating matrix, calculating group rating scores, and deriving individual preference lists.
Step 1 aims to overcome the sparsity problem by exploiting extra information from user comments. Item and user vectors are derived.
Step 2 aims to deal with the scalability problem. In this step, products and users are clustered into item groups and user groups, respectively. As a result, the numbers of products and users are greatly reduced. In step 3, the original user-item rating matrix is converted to a reduced user-item rating matrix that involves item groups and user groups. In step 4, a series of random walks are executed on the reduced user-item rating matrix, and a preference list of item groups is derived for each user group. Since low numbers of item groups and user groups are involved, this step can be done efficiently. Finally, in step 5, reverse transformations are performed, and preference lists of individual products are offered to each user. The pseudo-code of the approach is given below. Details will be described later.

The Proposed Approach
/ Step 1: Developing item and user vectors / Extract training patterns from the collected user comments; Train the Word2Vec network and obtain item vectors and user vectors; / Step 2: Grouping for items and users / Reduce the dimensionality of item and user vectors; Cluster the item and user vectors into item and user groups; Obtain transformation matrices of membership degrees; / Step 3: Getting the reduced user-item rating matrix / Transform the original user-item rating matrix to the reduced matrix; / Step 4: Calculating group rating scores / Construct the correlation matrix from the reduced user-item rating matrix; Apply ItemRank to predict preference lists of item groups for user groups; / Step 5: Deriving individual preference lists / Reversely transform to obtain preference lists of products for users;

Step 1-Developing Item and User Vectors
As mentioned, it is usually difficult to discover similarity between items or users solely based on the sparse ratings provided in the user-item rating matrix. Extra sources have to be consulted for help. In this step, we apply Word2Vec on user comments to discover semantic relationships between products. Each product is converted to an item vector. Then, the item vectors can be used for clustering products later. In addition, the users can be converted to user vectors, by which the clustering on users can be done.
We address the data sparsity problem by finding the relationship between items from the user comments posted on the public domain. We use these user comments to train a Word2Vec network. The word vectors obtained after training should reveal semantic relationships among items, so it is required that the inputs to the Word2Vec network be items in the training patterns.
We take the user review text of an item and insert the item identity into the text repeatedly. Suppose that the window size is 2s + 1. The item identity is inserted in the middle of every 2s words in the original text. Training patterns, each consisting of an input-output word pair (x, y), are extracted from the windows. For each window, the central word is taken as input x, and each of its context words is taken as output y to form training patterns. Assume that a window contains 2s + 1 words, w r+1 · · · w r+s w r+s+1 w r+s+2 · · · w r+2s+1 , where w r+s+1 is the central word and the other words are context words. Then, there are 2s training patterns, (w r+s+1 , w r+1 ), . . . , (w r+s+1 , w r+s ), (w r+s+1 , w r+s+2 ), . . . , (w r+s+1 , w r+2s+1 ), extracted from the window. Let the review text of an item, with its item identity being 0000013714, be: I bought this for my husband who plays the piano He is having a wonderful time playing these old hymns The music is at times hard to read because we . . . For s = 3, the user review text after insertion is shown below, ignoring punctuation marks: I bought this 0000013714 for my husband who plays the 0000013714 piano He is having a wonderful 0000013714 time playing these old hymns The 0000013714 music is at times hard to 0000013714 read because we . . .
Note that the item identity 0000013714 appears in the middle of each non-overlapping window of seven words. For instance, the first window consists of the following words: I bought this 0000013714 for my husband.
Then, we extract training patterns from each window. Considering the first window above, six training patterns are extracted: Note that the first words in the training patterns are the item identity.
After extracting the training patterns from the collected user comments, the training patterns are fed into the Word2Vec network [21,22], as shown in Figure 1, for training.
Appl. Sci. 2020, 9, x FOR PEER REVIEW 6 of 17 Note that the item identity 0000013714 appears in the middle of each non-overlapping window of seven words. For instance, the first window consists of the following words: I bought this 0000013714 for my husband.
Note that the first words in the training patterns are the item identity. After extracting the training patterns from the collected user comments, the training patterns are fed into the Word2Vec network [21,22], as shown in Figure 1, for training. The neural network is trained as follows. Suppose ( , ) is a training pattern involving two words, and . Word corresponds to node in the input layer, and word corresponds to node in the output layer, with input weights 1 = [ 1 1 ⋯ 1 ] and output weights 2 = [ 1 2 ⋯ 2 ] . Note that the input value at node is 1, while those at the other nodes in the input layer are 0. In addition, the desired value at node is 1, while those at the other nodes in the output layer are 0. The network output at node is where `·' is the inner product operator. Let be = − log . By steepest descent, 1 and 2 are adjusted as The neural network is trained as follows. Suppose (w i , w k ) is a training pattern involving two words, w i and w k . Word w i corresponds to node i in the input layer, and word w k corresponds to node k in the output layer, with input weights v 1 Hi T and output weights v 2 Note that the input value at node i is 1, while those at the other nodes in the input layer are 0. In addition, the desired value at node k is 1, while those at the other nodes in the output layer are 0. The network output at node k is where '·' is the inner product operator. Let J be J = − log e o k . By steepest descent, v 1 i and v 2 k are adjusted where η is the learning rate. During the learning, if two training patterns (w i , w k ) and w j , w k occur frequently in the training documents, v 1 i and v 1 j will be close to each other. When the learning is completed, the input weights of the network are taken to be item vectors. That is, v 1 1 is the word vector of word w 1 , v 1 2 is the word vector of word w 2 , etc. Note that item vectors are word vectors of the item identities. For M products, we have M item vectors in total, p 1 , p 2 , · · · , p M , as shown below: for i = 1 , · · · , M. Clearly, each item vector contains H components. In this work, we set H = 100. Note that if two items are semantically similar, then their corresponding item vectors are close to each other. Now that item vectors are available, we can derive user vectors, which can then be used for similarity computation. First, we normalize each row in the user-item rating matrix R as follows: for 1 ≤ i ≤ N. Then, user vectors, u k , 1 ≤ k ≤ N, are derived as follows: Note that each user vector also contains H components.

Step 2-Grouping for Items and Users
This step aims to deal with the scalability problem. First of all, the dimensionality, H, can be too large for effective processing. PCA (Principal Component Analysis) [44,45] is applied to reduce the dimensionality of the item and user vectors. We choose q p and q u as principal components so that they are as small as possible, and the cumulative energy is above a certain threshold θ, e.g., 0.7. Therefore, there are q p , instead of H, values in p i , i = 1, · · · , M, and there are q u , instead of H, values in u i , i = 1, · · · , N. Now, we apply an iterative self-constructing clustering algorithm [46], which is an iterative version of that used in [23], to partition the item and user vectors into item and user groups, respectively. We feed the item vectors, p i , 1 ≤ i ≤ M, as the training patterns, to the clustering algorithm. For each pattern p i , i = 1, · · · , M, the membership degree of the pattern to each existing cluster C j , with center c p 1 j · · · c p q p j T and deviation d p 1 j · · · d p q p j T , is calculated as Appl. Sci. 2020, 10, 2540 8 of 18 If G j (p i ) < ρ for all existing clusters, a new cluster is created, and p i is included in the new cluster. Otherwise, p i is added into the cluster with the largest membership degree. Note that Gaussian is adopted to describe the data distribution of each cluster, and G j (p i ) lies in the range of (0, 1]. Suppose we have M g clusters C p 1 , C p 2 , · · · , C p M g at the end, each having its own center and deviation. Each cluster is regarded as one item group. Therefore, we have M g item groups, denoted p for 1 ≤ i ≤ M and 1 ≤ j ≤ M g . Then, we form the transformation matrix which is an M × M g matrix, containing all the membership degrees of the products p i , 1 ≤ i ≤ M, to the item groups p g j , 1 ≤ i ≤ M g . Next, we feed the user vectors, u i , 1 ≤ i ≤ N, as the training patterns, to the clustering algorithm. Suppose we have N g clusters C u 1 , C u 2 , · · · , C u N g at the end. Each cluster is regarded as one user group. Therefore, we have N g user groups, denoted u g 1 , u g 2 , · · · , u g N g , respectively. Let t u ij denote the membership degree of u i to cluster C u j , computed as for 1 ≤ i ≤ N and 1 ≤ j ≤ N g . Then, we form the transformation matrix which is an N × N g matrix, containing all the membership degrees of the users u i , 1 ≤ i ≤ N, to the user groups u g j , 1 ≤ j ≤ N g .

Step 3-Getting a Reduced User-Item Rating Matrix
In this step, we transform the original user-item rating matrix R to the reduced matrix R g . We first convert R to a matrix B by which is an N × M g matrix, indicating the ratings for the M g item groups by the N users. Next, we convert B to R g by . . . . . .
Appl. Sci. 2020, 10, 2540 9 of 18 which is an N g × M g matrix. Note that R g contains the ratings for the M g item groups by the N g user groups. We call R g the reduced user-item rating matrix.

Step 4-Calculating Group Rating Scores
In this step, one preference list of item groups is derived for each user group. Since low numbers of item groups and user groups are involved, this step can be done efficiently. Firstly, a correlation graph is built from the reduced user-item rating matrix R g . Then, a series of random walks based on ItemRank [9] are performed.
Based on R g , we construct a correlation graph that shows the inter-relationship among the item groups. Each item group is regarded as a node in the graph, and thus we have M g nodes in total. The weight m g ij on the edge between node p g i and node p g j , 1 ≤ i, j ≤ M g is properly defined [23]. When the correlation graph is completed, we have the following correlation matrix: which is a M g × M g matrix. Then, we normalized each column by: for 1 ≤ j ≤ M g . After creating the correlation matrix M g , we proceed with ItemRank to predict a preference list of item groups for each user group. Consider any user group u which is a vector with M g elements. The following operation is performed iteratively for t = 0, 1, 2, · · · until convergence. Note that r g i is the transpose of the ith row of R g and α ∈ [0, 1] is a user-defined constant. A common choice for α is 0.85. Usually, convergence is reached after 20 iterations [9]. Let h i be the converged result, having M g components. Then, h i is the predicted preference list of item groups for the user group u g i , 1 ≤ i ≤ N g .

Step 5-Deriving Individual Preference Lists
Now, we have obtained h 1 , h 1 , · · · , h N g for user groups u g 1 , u g 2 , · · · , u g N g , respectively. Each h i , 1 ≤ i ≤ N g , is a preference list in which M g item groups are involved. However, we are interested in recommending individual products to each user. In this step, we do reverse transformation to get a predicted preference list of products for each user.
Firstly, we find a predicted preference list of products for each user group. We compute vector y i by for 1 ≤ i ≤ N g . Clearly, y i is of length M, and it is the predicted preference list of products for user group u g i . Next, we compute vector s i as s i = t u i1 y 1 + t u i2 y 2 + · · · + t u iN g y N g (22) for 1 ≤ i ≤ N, which is the predicted preference list of products for user u i . Similar to ItemRank, the products can be recommended to user u i in the order according to the magnitudes of the elements in s i .

Experimental Results
To evaluate the performance of our proposed approach, we conduct a set of experiments on several benchmark datasets. For convenience, we call our approach CFUCC (Collaborative Filtering based on User Comments and Clustering) in the remainder of this section. We also compare our CFUCC approach with some other collaborative filtering recommender systems.
Two metrics are adopted for comparison of recommendation accuracy, mean absolute error (MAE) and root mean squared error (RMSE) [47]. Let P be the set of all products, and L i and T i be two sets containing the products user u i has rated in the training set and in the testing set, respectively. Note that it is required that none of L i is empty, i.e., L i ∅, 1 ≤ i ≤ N. To compute MAE or RMSE, we have to convert predicted preferences to corresponding predicted scores. Letr ij be the predicted score corresponding to the predicted preference of product p j for user u j . We computer ij bŷ where r a i is the average of the ratings in L i , r a k is the average of the ratings in L k , r k j is the rating for product j in L k , and Sim(u i , u k ) indicates the similarity between user u i and user u k defined by the cosine of the predicted preference lists s i and s k : with '·' being the inner product operator for vectors. Then, MAE and RMSE are defined as Note that a low MAE or RMSE value indicates the superiority of a recommender system. A 5-fold cross-validation is adopted for our experiments. In each experiment, the entries in a dataset are split randomly into five different subsets. Then, five runs are performed. Each time, four of the five subsets are used for training, and the remaining one is used for testing. Then, the results of the five runs are averaged. Note that in each of the following experiments, the predicted preference lists are derived from the training set and all MAE and RMSE are measured on the testing set. In addition, no overlapping exists between the training set and the testing set in any experiment.
In the following experiments, we work with three datasets, Amazon Digital Music, Amazon Video Game, and Amazon Apps for Android. They were collected by Julian McAuley from Amazon.com over a period of 18 years (2014−1996) and were described in [48,49]. Each piece of data contains the information as shown in Figure 2. Several fields of information were recorded. testing set. In addition, no overlapping exists between the training set and the testing set in any experiment.
In the following experiments, we work with three datasets, Amazon Digital Music, Amazon Video Game, and Amazon Apps for Android. They were collected by Julian McAuley from Amazon.com over a period of 18 years (2014−1996) and were described in [48,49]. Each piece of data contains the information as shown in Figure 2. Several fields of information were recorded. However, we are interested in the fields of 'reviewerID' (user ID), 'asin' (item ID), 'reviewText' (user text comments to the item), and 'overall' (user's rating to the item). Among them, 'asin' and 'reviewText' are used by Word2Vec to form item vectors, while 'reviewerID' and 'overall' are used in forming the user-item rating matrix. Each non-zero entry in the matrix is represented as a triple , , where ∈ 1, 2, 3, 4, 5 . The Amazon Digital Music dataset contains 64,706 reviews, with 5541 users and 3568 items. The sparsity of this dataset is 1 99.7%. The Amazon Video Game dataset contains 231,780 reviews, with 24,303 users and 10,672 items. The sparsity of this dataset is 99.9%. The original Amazon Apps for Android dataset contains 752,937 reviews, with 97,281 users and 13,209 items and a sparsity of 99.93%. To run other systems easily, we randomly take 385,482 reviews, with 35,001 users and a sparsity of 99.92%. The characteristics of these datasets are summarized in Table 3. As can be seen, sparsity and scalability are very serious in all of these datasets.  However, we are interested in the fields of 'reviewerID' (user ID), 'asin' (item ID), 'reviewText' (user text comments to the item), and 'overall' (user's rating to the item). Among them, 'asin' and 'reviewText' are used by Word2Vec to form item vectors, while 'reviewerID' and 'overall' are used in forming the user-item rating matrix. Each non-zero entry in the matrix is represented as a triple u i , p i , r ij where r ij ∈ {1, 2, 3, 4, 5}. The Amazon Digital Music dataset contains 64,706 reviews, with 5541 users and 3568 items. The sparsity of this dataset is 1 − 64706 5541×3568 = 99.7%. The Amazon Video Game dataset contains 231,780 reviews, with 24,303 users and 10,672 items. The sparsity of this dataset is 99.9%. The original Amazon Apps for Android dataset contains 752,937 reviews, with 97,281 users and 13,209 items and a sparsity of 99.93%. To run other systems easily, we randomly take 385,482 reviews, with 35,001 users and a sparsity of 99.92%. The characteristics of these datasets are summarized in Table 3. As can be seen, sparsity and scalability are very serious in all of these datasets. We show the effectiveness, both regarding the accuracy and efficiency, of our approach, CFUCC, by comparing it with some other collaborative filtering methods, including SCC (Self-Constructing Clustering) [23], CFCB (Collaborative Filtering by Clustering Both users and items) [2], BiFu (Biclustering and Fusion) [26], and ICRRS (Iterative method for Calculating Robust Rating Scores) [35]. SCC applies clustering on products, but users are not clustered. BiFu applies K-means to cluster the users and products, and ICRRS uses a reduction technique that decouples the credibility assessment of the cast evaluations from the ranking itself. CFCB clusters users based on users' ratings on products. For a fair comparison, we wrote programs for these methods. All the programs were written in Python 3.6, running on a computer with Intel(R) Core(TW) i7-4790K CPU, 4.00 GHz, 32GB of RAM, and 64 bits windows 10.

Experiment 1-Amazon Digital Music Dataset
We work with the Amazon Digital Music dataset in this experiment. Table 4 shows comparisons on MAE, RMSE, and Time(s) among CFUCC, SCC, ICRRS, CFCB, and BiFu. In this table, the value obtained by the best method for each case is shown in boldface. For CFUCC, we use ρ = 0.7 for item clustering, ρ = 0.6 for user clustering, v 0 = 0.5, and θ = 0.7 for PCA. CFUCC clusters products into 35 groups and users into 32 groups, SCC clusters products into 204 groups, CFCB clusters products into 35 groups, and BIFU divides into 35 groups for items and users, respectively. As can be seen from Table 4, CFUCC is the best among the methods. CFUCC has the lowest value, 0.709, in MAE and the lowest value, 0.996, in RMSE, and runs most efficiently, 51.9 s in the CPU time.

Experiment 2-Amazon Video Game Dataset
We work with the Amazon Video Game dataset in this experiment. Table 5 shows comparisons among different methods. CFUCC clusters products into 32 groups and users into 15 groups, SCC clusters products into 19 groups, CFCB clusters products into 35 groups, and BIFU divides into 35 and 20 groups for items and users, respectively. As can be seen from this table, CFUCC performs the best. CFUCC has the lowest value, 0.791, in MAE and the lowest value, 1.224, in RMSE, and runs most efficiently, 111.7 s in the CPU time.

Experiment 3-Amazon Apps for Android Dataset
We work with the reduced version of the Amazon Apps for Android dataset in this experiment. Table 6 shows comparisons among different methods. CFUCC clusters products into 25 groups and users into 38 groups, SCC clusters products into 153 groups, CFCB clusters products into 35 groups, and BIFU divides into 25 and 35 groups for items and users, respectively. In this experiment, CFUCC performs the best in every case. CFUCC has the lowest value, 1.024, in MAE, the lowest value, 1.331, in RMSE, and it runs most efficiently, 194.2 s in the CPU time.

Experiment 4-Impact of Imputation
In this experiment, we show how imputation has an impact on the recommended results. Several strategies have been proposed to deal with missing values in the user-item rating matrix. Among them, four popular ones are:

•
No imputation. If user u i has not provided a rating for product p j , r ij =0. That is, a missing value is treated as 0. • Imputation by least. Each missing value is replaced with the least positive score. For the case of scores being 1-5, a missing value is replaced with 1. • Imputation by mid. Each missing value is replaced with the middle of positive scores. For the case of scores being 1-5, a missing value is replaced with 3.
• Imputation by mean. Each missing value is replaced with the mean value of all the positive ratings in the given user-item rating matrix. Table 7 shows the performance comparison of these four strategies for the Amazon Digital Music dataset. As can be seen, CFUCC is hardly affected by imputation. The reason is that CFUCC uses user comments, instead of rating scores in the user-item rating matrix, to generate and cluster item vectors. Therefore, it is more resistant to the noise caused by imputation strategies. For the other methods, the recommended results depend on the rating scores in the matrix, and thus, they behave differently with different imputation strategies.

Experiment 5-Different Ways of Producing User Vectors
We compare three methods of producing user vectors in CFUCC. The three methods are: • Method 1. This is the method CFUCC adopts, as described previously. • Method 2. The user identity shown in the "reviewerID" field is inserted repetitively into the review text in the "reviewText" field, as the way we do for generating item vectors. By training a Word2Vec network, the user vectors are produced. • Method 3. All the review texts of a user are collected in one document. Then, Doc2Vec [50] is applied to obtain a document vector for the document, and the obtained document vector is regarded as the user vector for this user. Table 8 shows the performance comparison of these methods for the Amazon Digital Music dataset. As can be seen, Method 1, which we adopt in CFUCC, gets the best results. In general, a review given by a user provides comments on items instead of on users. Therefore, the user vectors obtained by Method 2 are less effective. In addition, the collected document for a user is usually too short to represent himself/herself in a distinctive way. Worst of all, a user may not have given any comments. Consequently, Method 3 performs less effectively.

Experiment 6-Impact of User Comments
We show the effectiveness of using user comments for clustering products and users in CFUCC.

•
Method 1. This is the method that CFUCC adopts. That is, it applies Word2Vec on user comments to get item vectors from which user vectors are obtained, and then clusters products and users based on the item and user vectors.
• Method 2. User comments are not used. Products and users are clustered directly based on the rating scores provided in the given user-item rating matrix. Table 9 shows the performance comparison of these methods for the Amazon Digital Music dataset. Note that both methods use identical values for the parameters involved in the recommendation process. As can be seen, Method 1, which we adopt in CFUCC, performs better in terms of both accuracy and efficiency. Method 2 uses the user-item rating matrix for clustering. This causes the sparsity problem and makes the recommended results less accurate. Clearly, both the MAE and RMSE of Method 2 are worse than those of Method 1. Secondly, Method 2 clusters users directly. Since there are 3568 products in Amazon Digital Music dataset, each pattern is a 3568-dimensional vector to be considered in the clustering process. In Method 1, by PCA and Word2Vec, each pattern is only a vector of size less than 100. Therefore, users can be clustered faster by Method 1, and recommendation can be done more efficiently.

Experiment 7-Impact of Parameter Settings
The energy threshold θ determines the degree of reduction on the dimensionality of the vectors produced by PCA. Table 10 compares the performance of four different values of θ for the Amazon Digital Music dataset. The columns "URdim" and "IRdim" indicate the reduced dimensionality of user vectors and item vectors, respectively, after PCA. Note that the original vectors have 100 dimensions. When θ = 0.7, the reduced user vectors have nine dimensions, while the reduced item vectors have 28 dimensions. As expected, a larger θ results in a higher dimensionality of the reduced vectors. Note that CFUCC with θ = 0.7 performs the best. However, the performance is pretty stable with values near 0.7. Next, we show the impact of different values of ρ on the recommended results obtained by CFUCC. Note that a different ρ value can lead to a different number of clusters produced. A higher ρ will produce more clusters. Table 11 compares the performance of different values of ρ for the Amazon Digital Music dataset. In the left part of the table, the ρ for clustering user vectors is kept at 0.6, while the ρ value varies for clustering item vectors. In the right part of the table, the ρ for the clustering item vectors is kept at 0.7, while the ρ value varies for clustering user vectors. As can be seen, the value of ρ does affect the performance of CFUCC, but not significantly.

Conclusions
We have presented an extension of [23] to address the inaccuracy and inefficiency caused by the data sparsity and scalability in collaborative filtering recommendation. Word2Vec is applied on user comments to assign an item vector to each product. Through the user-item rating matrix, the user vectors of all the users are also produced. Then, the products and users are clustered into item groups and user groups, respectively. Based on these item groups and user groups, recommendations to a user can be made. Experimental results have shown that both the accuracy and efficiency of recommendation are improved by the proposed approach.
Some research work will be investigated for the system in the future. Doc2Vec or other embedding techniques may be used for developing item vectors. Item descriptions, such as those available in the Amazon datasets, can be used for creating item or user vectors. We may use the reviews to learn word embeddings and take each item to be the aggregation of the embedded vectors of the words that are in the reviews. We have only adopted MAE and RMSE to evaluate the quality of the predictions. Other measures such as the precision, recall, and nDCG may be used to evaluate the quality of recommendations. Recently, neural collaborative filtering has been proposed for recommendation [51,52]. It will be interesting to investigate the effectiveness of incorporating it in our system.