Collaborative Filtering Recommendation Algorithm Based on TF-IDF and User Characteristics

: The recommendation algorithm is a very important and challenging issue for a personal recommender system. The collaborative ﬁltering recommendation algorithm is one of the most popular and effective recommendation algorithms. However, the traditional collaborative ﬁltering recommendation algorithm does not fully consider the impact of popular items and user characteristics on the recommendation results. To solve these problems, an improved collaborative ﬁltering algorithm is proposed, which is based on the Term Frequency-Inverse Document Frequency (TF-IDF) method and user characteristics. In the proposed algorithm, an improved TF-IDF method is used to calculate the user similarity on the basis of rating data ﬁrst. Secondly, the multi-dimensional characteristics information of users is used to calculate the user similarity by a fuzzy membership method. Then, the above two user similarities are fused based on an adaptive weighted algorithm. Finally, some experiments are conducted on the movie public data set, and the experimental results show that the proposed method has better performance than that of the state of the art.


Introduction
With the advent of the big data era, information on the Internet has grown exponentially. People have entered the era of information explosion from the past when information was scarce. However, most of this massive amount of information is worthless. The information explosion has made it more and more difficult for people to obtain valuable information from the Internet [1]. To improve the efficiency of production and life, people need some information filtering technologies to filter out useless information. The recommender systems are software tools and techniques providing suggestions for items which are useful to a user. As one of the effective information filtering tools, the personalized recommendation system can help users efficiently obtain information that meets their needs when their needs are unclear [2].
The core of a personalized recommendation system is the recommendation algorithm, which mainly includes the content-based recommendation algorithm, collaborative filtering recommendation algorithm, and hybrid recommendation algorithm [3,4]. Among them, because of the high efficiency, accuracy, and personalization, the collaborative filtering recommendation algorithm has become one of the most effective and extensive application recommendation algorithms [5]. For example, Nakagawa and Ito [6] proposed a recommendation system which can recommend interesting document files to users by collaborative filtering. Yu et al. [7] presented the application of a collaborative filtering algorithm in the field of E-commerce. Park et al. [8] presented a fast collaborative filtering algorithm with a k-nearest neighbor graph. Wu et al. [9] used a collaborative filtering algorithm to improve the prediction accuracy of large-scale recommendation system. Bartolini et al. [10] implemented a personalized recommendation. Although the collaborative filtering algorithm has been widely used, there are still some problems such as data sparsity, cold start, and information expiration, etc [11].
To solve the problems above, a series of improvements based on the traditional collaborative filtering algorithm were made and achieved some success. For example, Piraste et al. [12] alleviated the sparsity and cold start problems of the matrix using the film type label and director genre. Kumar et al. [13] used matrix decomposition technology to reduce the dimension of the matrix and improve the accuracy of the recommendation result. Sun and Dong [14] proposed a dynamic time drift model considering the influence of user interest changes on similarity in different time periods. Wang et al. [15] proposed a collaborative filtering algorithm combining the KNN model and XGBoost model. Zarzour et al. [16] presented a new effective model-based trust collaborative filtering to improve the quality of recommendation. In addition, there are some collaborative filtering algorithms based on clustering [17], neural networks [18], and various probability models [19]. The above studies optimized the recommendation model to a certain extent and improved the accuracy of the recommendation results, but there are still some problems to be further studied. For example, most of the existing collaborative filtering algorithms only consider the rating information among users, but ignore the user characteristics and the impact of popular items on user similarity, which leads to poor recommendation results.
To further improve the accuracy of recommendation, a collaborative filtering algorithm based on the TF-IDF method and user characteristics is proposed in this paper. In the proposed method, both the rating information and the characteristics of the users are fully considered. The contribution of this paper can be summarized as follows: (1) Based on the rating data, the TF-IDF method is used to calculate the user similarity matrix to punish the impact of popular items on user similarity, and to improve the ability of mining unpopular items. (2) The user characteristics are fully considered in the proposed method, which is used to calculate user similarity based on a fuzzy membership function, to deal with the cold start problem by combining different dimension characteristics information of users. (3) An adaptive weighted algorithm is presented to fuse the two kinds of similarities of users obtained on the above two steps, to form a new user comprehensive similarity for recommendation algorithm. At last, experiments are carried out on real data sets to evaluate the accuracy of the proposed recommendation model. Experimental results show that the proposed algorithm is better than the state-of-the-art algorithms in accuracy.
This paper is organized as follows. Section 2 gives out an overview of related work. The proposed algorithm is presented in Section 3. Section 4 provides the experiments and results analysis. Discussions on the parameters and performance of the proposed algorithm are carried out in Section 5. Section 6 gives out the conclusions.

Related Work
The basic idea of the collaborative filtering algorithm can be simply summarized as recommending items of interest to target users who have similar interests [20,21]. As shown in Figure 1, the collaborative filtering algorithm is mainly divided into three steps, namely establishing the user-item rating matrix, finding other users with similar interests to the target users, and finally making recommendations by rating and predicting based on similar users. Traditional collaborative filtering (CF) algorithms are mainly divided into user-based collaborative filtering (UCF) and item-based collaborative filtering (ICF) (see Figure 2). There are many improvements in the collaborative filtering recommendation algorithms, to solve the data sparsity and cold start problem. These existing methods give a good research basis for the recommendation system.
In this paper, the user-based collaborative filtering algorithm is focused on, which is more suitable for responding to the favorite items for groups with similar interests, and the recommendation results are more social. The proposed collaborative recommendation algorithm is similar to those existing TF-IDF-based methods. However, there are many differences between the proposed methods with those existing methods. In the proposed method, the TF-IDF method is applied to rating data, and the user characteristics are fused to optimize the user similarity and improve the accuracy of rating prediction. It is different from those methods that use a time-dependent similarity measure to compute the user similarity without considering user characteristics [22]. It is also different from those methods that directly calculate the user similarity through the TF-IDF method [23].

Generate recommendations
Nearest neighbor Similarity matrix Data set preprocessing User-item rating matrix  The user-based collaborative filtering algorithm first needs to calculate the similarity between the target user and other users. Then, some users with high similarity are selected as the nearest neighbor set. Finally, aim at items in the neighbor set and predict all ratings of the target user. The main process of the traditional user-based collaborative filtering algorithm will be described as follows.

Data Preprocessing
Suppose the data set of a recommender system is D{U, I, R}, where U = {u 1 , u 2 , . . . , u n } is the user set of the system, I = {i 1 , i 2 , . . . , i m } is the item set of the system, and R is a user-item rating matrix. For a data set with m users and n items, the data are preprocessed to obtain a m × n user-item rating matrix R(m × n), which is shown as follows: where r ij represents the rating data of user U i for item I j .

Similarity Calculation
In the recommender system, there are three main methods used to calculate the similarity between two users: the cosine similarity, adjusted cosine similarity, and pearson similarity [24]. In this study, the pearson similarity will be used, which is calculated through the common rating items between any two users. The pearson similarity is shown as follows: where R u,i and R v,i represent the ratings of user u and user v on the i-th item, respectively; R u andR v represent the average of all the ratings of user u and user v, respectively.

Generate Recommendation Set
Before rating prediction to generate recommendation, it is necessary to determine the target user's similar neighbor set. A similar neighbor set refers to the set of users who have similar preferences with the target user. In the recommendation system, the most K similar users are usually selected as the nearest neighbor set to form the similar neighbor set of the target user [25].
After the neighbor set of the target user is selected, it combines with all the neighbors' ratings of the items and the similarity between the users to predict the target user's ratings on the test set. The rating prediction is calculated as follows [26]: where P u,i represents the prediction rating of user u for unknown item i; S(u, K) is the set of K users most similar to user u; and N(i) represents the set of users who have rated item i. After rating prediction, select the N items with the highest rating from the predicted rating set as the recommendation results to the target users, and the recommendation process ends [27].

Proposed Method
As introduced in Section 2, the traditional user-based collaborative filtering algorithms usually only use the user's rating information, but ignore the impact of other aspects of user information and popular items on user similarity. To deal with these problems, an improved collaborative recommendation algorithm (defined as ICFTU) is proposed by combining the Term Frequency-Inverse Document Frequency (TF-IDF) method and user characteristics model. The overall framework of the proposed algorithm is shown in Figure 3, which has three main parts, namely the improved TF-IDF-based method, the improved user characteristics model, and the proposed fusion strategy. The proposed method will be presented as follows.

Improved TF-IDF Based Method
The traditional collaborative filtering algorithm calculates the user similarity matrix based on the user's rating data of items, which is easily affected by popular items. For example, "Shawshank Redemption" is a very good movie. If the user A and user B both gave the movie "Shawshank Redemption" 5 points, the traditional collaborative filtering algorithm will come to the conclusion that the user A and user B have high similarity. However, the fact is not necessarily the case. As we know, the same behavior of users on popular items does not mean that they have similar interests. On the contrary, if two users have taken the same behavior on unpopular items, it is more likely that their interests are similar. For example, if both users A and B have watched a relatively small number of movies, such as musicals, then they can be considered to have similar interests. Therefore, in order to eliminate the impact of popular items on the user similarity, the Term Frequency-Inverse Document Frequency (TF-IDF) method is applied to the traditional collaborative filtering algorithm in this paper, which is used to punish the popular items in the user behavior list. The main reason to use the TF-IDF method is that it is suitable for the problem of weight extraction. In addition, the TF-IDF method is simple and easy to calculate [28].
TF-IDF is a statistical method, which is often used to evaluate the importance of a word to a file. The importance of a word is directly proportional to the number of times it appears in the file, but at the same time, it is inversely proportional to the frequency it appears in the file library [28]. Based on the principle of TF-IDF, an improved user similarity calculation method is proposed to reduce the weight of the impact of popular items on the user similarity. If an item appears in the user's behavior list, but it also appears many times in other users behavior list, this item is regarded as a popular item, and its impact on the user similarity should be punished. The weight of the i-th item in this paper is calculated as: where f req(i, u) represents the number of times that the i-th item appears in the behavior list of user u; | u | represents the length of behavior list of user u; | U | represents the total number of users; and popular(i) represents the number of times that the i-th item appears in all of the user behavior lists. Then the weight of the item is introduced into the equation of Pearson similarity (see Equation (2)), and an improved similarity calculation method is obtained as:

Improved User Characteristics Model
In real life, people living in the same area tend to have similar lifestyle and eating habits, while people in different areas may show greater differences. Similarly, if two people's characteristics are more similar, such as gender, age, and occupation, then their interests are more likely to be similar. For example, there will be more common topics between students, but students and teachers may have different interests due to their different work and social experiences. Therefore, it is reasonable to recommend a user preference item to other users similar to their characteristics when making recommendations. There are some improved collaborative filtering algorithms, which have used the user's characteristics information. However, there are still some problems in the existing method, for example, the similarity of age and occupation is calculated in a crude way, which makes the recommendation results have some limitations [29].
To deal with these problems above, an improved user characteristics similarity model is set up in this paper, which is based on the fuzzy membership method. The proposed user characteristics similarity model can alleviate the cold start problem of the recommendation system caused by the lack of rating data for new users. The user characteristics similarities in this study are defined as follows: (1) Age similarity Suppose that if the age difference is less than 5 years, the similarity is regarded as 1, and if the age difference is more than 25, the similarity is regarded as 0. The fuzzy membership for the age similarity of users is defined as follows: (2) Occupation similarity The traditional method of occupation similarity calculation is that: if the occupation is the same, the occupation similarity is set as 1, otherwise it is set as 0. Although it can measure the similarity of two users to a certain extent, the user's characteristics are not fully exploited. In this paper, a tree diagram for the classification of occupations is set up first based on the international standard classification of occupations [30], which is shown in Figure 4. The distance between the two nodes is 1 The distance between the two nodes is 2 In this occupation classification tree, the distance between two nodes is defined as the number of edges between these two nodes. The distance between the parent node and child node is 1, and the distance between adjacent brother nodes is 2. The distance between the farthest two occupations in the occupation classification tree is defined as D max . Then the fuzzy membership for the occupation similarity of users is defined as follows: where d u,v is the occupation distance between users A and B; and τ is the correction coefficient, which is adjusted dynamically according to the occupation.
(3) Gender similarity Different gender users have different preferences for items, so the gender should be taken into account when calculating the similarity of user characteristics [29]. Assuming that the gender of user u is G u , and the gender of user v is G v , the gender similarity of users is calculated as: (4) User characteristics similarity Combining the mentioned characteristics similarity of users in different dimensions based on the age, gender, and occupation, the final characteristics similarity of users is calculated as: where α + β + δ = 1 and α, β, δ ∈ (0, 1) are the similarity weights for the user's age, gender, and occupation. For different recommender systems, these weights can be adjusted dynamically to achieve the optimal recommender effect.

Proposed Fusion Strategy to Generate Recommendation
Based on the improved similarity calculation method above, the final user comprehensive similarity calculation method can be obtained by weighted fusion, namely: where ξ + µ = 1, and ξ, µ ∈ (0, 1) represent the weights for the similarity obtained based on the TF-IDF method and the user characteristics. For different recommender systems, ξ and µ should be optimized. In this paper, a searching algorithm is proposed to obtain the optimal values of ξ and µ, which is shown in Algorithm 1: µ, ξ = f (µ, ξ); 14: end After obtaining the user's comprehensive similarity Sim p (u, v), K users which are most similar to the target user are selected as the nearest neighbors to form the similar neighbor set of the target user. Combined with the rating information of all neighbors and the similarity with the target user, the target user's rating on the i-th unknown item is predicted. In this study, the rating prediction in (3) is changed to: The total work processing of the proposed collaborative recommendation algorithm is summarized as follows: • Step 1: Preprocess the rating data and construct the user-item rating matrix R(m × n); • Step 2: Use TF-IDF method and rating data to calculate the user similarity matrix Sim User (u, v); • Step 3: Use user characteristics information to calculate the user characteristics similarity matrix Sim Character (u, v); • Step 4: Fuse the similarity matrices from Step2 and Step3 to generate the final user comprehensive similarity matrix Sim p (u, v); • Step 5: After the comprehensive similarity matrix of a user is obtained, the nearest neighbor set of the target user is selected to make rating prediction and generate recommendations.

Dataset and Metrics
To verify the effectiveness of the improved algorithm, this paper uses a dataset from the MovieLens recommender system [31]. The Movielens dataset is a public movie dataset released by the GroupLens Laboratory of the University of Minnesota. At present, there are eight versions with different sizes. The dataset mainly includes the following information: users ID, items ID, user's rating information of the items, and time stamp of the rating, etc. The Movielens-100k (ML-100K) data set and Movielens-1M (ML-1M) data set are used in this paper, and the basic information of the two datasets are shown in Table 1. In the experiment, the dataset is randomly divided into training set and testing set according to the ratio of 8:2 for comparative analysis. There are many evaluation indexes of recommender systems [32]. Because the ultimate goal of the improved collaborative filtering recommendation algorithm is to improve the accuracy of the recommendation results, this paper mainly considers the accuracy of the algorithm. To evaluate the recommendation accuracy of the improved recommendation algorithm, the root mean square error (RMSE) and the mean absolute error (MAE) are used to measure the effect of recommended systems [33]. MAE and RMSE are the measurement of the deviation of recommendations from their true user-specified. MAE and RMSE values can be obtained by calculating the rating deviation between the actual rating and the predicted rating between users. The lower the values of RMSE and MAE, the higher the accuracy of the algorithm recommended. The calculation methods for MAE and RMSE are defined as: where N is the total number of rating forecast items in the testing set, r u,i represents the actual rating of user u for the i-th item, and p u,i is the prediction rating of user u for the i-th item.

Comparison Experiment
To evaluate the performance of the proposed algorithm (ICFTU), the traditional userbased collaborative filtering algorithm (UCF), the collaborative filtering algorithm with user characteristics (CFUC), the collaborative filtering algorithm based on clustering (K-MCF), and the algorithm based on optimizing similarity calculation (ICFOS) [34] are selected for comparison. These four algorithms used for comparison are classical and often used in the recommender systems. The above five algorithms are trained in the training set of two data sets, respectively, and the ratings prediction is carried out in the test set to compare the MAE and RMSE value of different algorithms. The nearest neighbor K is set as 35 for all of the algorithms used in this experiment, the comparison of recommendation accuracy is show in Table 2 and Figure 5.  It can be seen from the figure that the ICFTU proposed in this paper has a better recommendation effect on both datasets. Among them, the traditional user-based collaborative filtering UCF algorithm has the largest error and the lowest prediction accuracy. The CFUC method (a collaborative filtering algorithm with user characteristics) combines the user's characteristic information, which makes up for some defects of the traditional algorithms and improves the accuracy of recommendation. The K-MCF algorithm based on clustering and ICFOS algorithm based on optimizing similarity calculation both improve the accuracy of recommendation to a certain extent. The ICFTU algorithm combines the TF-IDF method and user characteristics, reduces the impact of popular items on user similarity, improves the calculation of user characteristics similarity, improves the accuracy of recommendation, and still has some advantages in large-scale data sets.

Parameter Discussion
In this section, the influence of the parameters involved in the algorithm mentioned in this paper will be discussed. The experiments are carried out on ML-100k data set, where about 20,000 rating data are used to test the influence of the parameters in the proposed algorithm.
(1) About the nearest neighbor First, the reasonable number of the nearest neighbor K is discussed, which is one of the key factors for the recommendation algorithm to achieve good results. The MAE and RMSE of the proposed method under different K are shown in Figure 6. The results in Figure 6 show that with the increase of the number of nearest neighbors of the target user, the MAE and RMSE of the algorithm show a downward trend and gradually tend to be saturated. Therefore, the nearest neighbor K is set as 35 in this study, to keep a relatively high accuracy and low computation. (2) About the user characteristic parameters Secondly, the user characteristic parameters α, β, and δ are discussed. In this experiment, the parameter adjustment step is set as 0.1. Because α+β+δ = 1, to keep all the three parameters bigger than 0, all the three parameters are set between [0.1, 0.8]. The MAE and RMSE of the proposed method under different α, β, and δ are shown in Figure 7. It can be seen that the MAE and RMSE are the minimum when α = 0.5, β = 0.2, and δ = 0.3. (3) About the model fusion parameters Thirdly, the similarity weighted fusion parameters ξ and µ are discussed. In this experiment, the parameter adjustment step is set as 0.05. The MAE and RMSE of the proposed method under different ξ and µ are shown in Figure 8. It can be seen that the MAE and RMSE are the minimum when ξ = 0.8 and µ = 0.2. This proves that the user similarity calculated by the TF-IDF method has the main influence on the recommendation algorithm, and the recommendation effect can be improved by properly fusing the user characteristics similarity.

Ablation Experiment
In this paper, an improved collaborative filtering recommendation algorithm based on the TF-IDF model and user characteristics model is proposed. To discuss the influence of the two main improvements of the proposed method, two ablation experiments are carried out. The experiment is carried out on ML-100k data set, and the results of the proposed method (ICFTU) in Section 4.2 are used as reference. The method which is only based on the TF-IDF model is called ICFTU-TI, and the method which is only based on the proposed user characteristics model is called ICFTU-UC. The results of this ablation experiment are shown in Tables 3 and 4 and Figure 9.  The results show that: (1) The ICFTU algorithm has the best performance and the smallest error, which shows that the method using both the TF-IDF and the improved user characteristics models are effective; (2) The error of ICFTU-TI is smaller than that of ICFTU-UC, and it is close to that of ICFTU. This shows that the TF-IDF method is the main factor to improve the accuracy of the model, and the improved user characteristics are the secondary factor, which is consistent with the discussion results on similarity weighted fusion parameters ξ and µ in Section 5.1.

Compared with Other Methods
To further show the performance of the proposed collaborative recommendation algorithm (ICFTU), it is compared with two other state-of-the-art recommendation approaches. The first one is a new collaborative filtering framework based on a gauss core and an extension classification method (known as GCEDA) [26]. The second one is an advanced approach, which is based on Deep Feed-Forward Neural Networks (known as DFFN) [35]. This comparison experiment is carried out on ML-100k data set, and the results are shown in Table 5 and Figure 10.  The results show that the proposed ICFTU in this paper has a better recommendation effect compared to GCEDA and DFFN. The results in Table 5 show that the performance of the GCEDA and DFFN methods are close to the K-MCF method (see Tables 2 and 5), the main reason is that all the three methods use the information of the input data by different strategy. However, the comprehensive performance of the proposed model is the best. The MAE value of the proposed model is 3.50% and 1.03% lower than that of GCEDA and DFFN. Meanwhile, the RMSE value of our model is 3.88% and 1.88% lower than that of GCEDA and DFFN.

Conclusions
To solve the impact of popular items on user similarity, the TF-IDF statistical method is used in this paper, and optimizes the formula to adapt to the recommendation model. At the same time, an improved user characteristics similarity calculation method is proposed, which makes use of the user characteristics information and alleviates the cold start problem. Finally, this paper conducts off-line experiments on Movielens data sets. Experimental results show that the proposed algorithm is more accurate than the comparison algorithm. There are still some problems that should be further studied in future, such as some new user similarity models by fusing the item tag and user characteristic, and the deep learning technology to mine the potential information of user and item.