1. Introduction
With the advent of the big data era, information on the Internet has grown exponentially. People have entered the era of information explosion from the past when information was scarce. However, most of this massive amount of information is worthless. The information explosion has made it more and more difficult for people to obtain valuable information from the Internet [
1]. To improve the efficiency of production and life, people need some information filtering technologies to filter out useless information. The recommender systems are software tools and techniques providing suggestions for items which are useful to a user. As one of the effective information filtering tools, the personalized recommendation system can help users efficiently obtain information that meets their needs when their needs are unclear [
2].
The core of a personalized recommendation system is the recommendation algorithm, which mainly includes the content-based recommendation algorithm, collaborative filtering recommendation algorithm, and hybrid recommendation algorithm [
3,
4]. Among them, because of the high efficiency, accuracy, and personalization, the collaborative filtering recommendation algorithm has become one of the most effective and extensive application recommendation algorithms [
5]. For example, Nakagawa and Ito [
6] proposed a recommendation system which can recommend interesting document files to users by collaborative filtering. Yu et al. [
7] presented the application of a collaborative filtering algorithm in the field of E-commerce. Park et al. [
8] presented a fast collaborative filtering algorithm with a
k-nearest neighbor graph. Wu et al. [
9] used a collaborative filtering algorithm to improve the prediction accuracy of large-scale recommendation system. Bartolini et al. [
10] implemented a personalized recommendation. Although the collaborative filtering algorithm has been widely used, there are still some problems such as data sparsity, cold start, and information expiration, etc [
11].
To solve the problems above, a series of improvements based on the traditional collaborative filtering algorithm were made and achieved some success. For example, Piraste et al. [
12] alleviated the sparsity and cold start problems of the matrix using the film type label and director genre. Kumar et al. [
13] used matrix decomposition technology to reduce the dimension of the matrix and improve the accuracy of the recommendation result. Sun and Dong [
14] proposed a dynamic time drift model considering the influence of user interest changes on similarity in different time periods. Wang et al. [
15] proposed a collaborative filtering algorithm combining the KNN model and XGBoost model. Zarzour et al. [
16] presented a new effective model-based trust collaborative filtering to improve the quality of recommendation. In addition, there are some collaborative filtering algorithms based on clustering [
17], neural networks [
18], and various probability models [
19]. The above studies optimized the recommendation model to a certain extent and improved the accuracy of the recommendation results, but there are still some problems to be further studied. For example, most of the existing collaborative filtering algorithms only consider the rating information among users, but ignore the user characteristics and the impact of popular items on user similarity, which leads to poor recommendation results.
To further improve the accuracy of recommendation, a collaborative filtering algorithm based on the TF-IDF method and user characteristics is proposed in this paper. In the proposed method, both the rating information and the characteristics of the users are fully considered. The contribution of this paper can be summarized as follows: (1) Based on the rating data, the TF-IDF method is used to calculate the user similarity matrix to punish the impact of popular items on user similarity, and to improve the ability of mining unpopular items. (2) The user characteristics are fully considered in the proposed method, which is used to calculate user similarity based on a fuzzy membership function, to deal with the cold start problem by combining different dimension characteristics information of users. (3) An adaptive weighted algorithm is presented to fuse the two kinds of similarities of users obtained on the above two steps, to form a new user comprehensive similarity for recommendation algorithm. At last, experiments are carried out on real data sets to evaluate the accuracy of the proposed recommendation model. Experimental results show that the proposed algorithm is better than the state-of-the-art algorithms in accuracy.
This paper is organized as follows.
Section 2 gives out an overview of related work. The proposed algorithm is presented in
Section 3.
Section 4 provides the experiments and results analysis. Discussions on the parameters and performance of the proposed algorithm are carried out in
Section 5.
Section 6 gives out the conclusions.
2. Related Work
The basic idea of the collaborative filtering algorithm can be simply summarized as recommending items of interest to target users who have similar interests [
20,
21]. As shown in
Figure 1, the collaborative filtering algorithm is mainly divided into three steps, namely establishing the user-item rating matrix, finding other users with similar interests to the target users, and finally making recommendations by rating and predicting based on similar users. Traditional collaborative filtering (CF) algorithms are mainly divided into user-based collaborative filtering (UCF) and item-based collaborative filtering (ICF) (see
Figure 2). There are many improvements in the collaborative filtering recommendation algorithms, to solve the data sparsity and cold start problem. These existing methods give a good research basis for the recommendation system.
In this paper, the user-based collaborative filtering algorithm is focused on, which is more suitable for responding to the favorite items for groups with similar interests, and the recommendation results are more social. The proposed collaborative recommendation algorithm is similar to those existing TF-IDF-based methods. However, there are many differences between the proposed methods with those existing methods. In the proposed method, the TF-IDF method is applied to rating data, and the user characteristics are fused to optimize the user similarity and improve the accuracy of rating prediction. It is different from those methods that use a time-dependent similarity measure to compute the user similarity without considering user characteristics [
22]. It is also different from those methods that directly calculate the user similarity through the TF-IDF method [
23].
The user-based collaborative filtering algorithm first needs to calculate the similarity between the target user and other users. Then, some users with high similarity are selected as the nearest neighbor set. Finally, aim at items in the neighbor set and predict all ratings of the target user. The main process of the traditional user-based collaborative filtering algorithm will be described as follows.
2.1. Data Preprocessing
Suppose the data set of a recommender system is
, where
,
is the user set of the system,
is the item set of the system, and
R is a user-item rating matrix. For a data set with
m users and
n items, the data are preprocessed to obtain a
user-item rating matrix
, which is shown as follows:
where
represents the rating data of user
for item
.
2.2. Similarity Calculation
In the recommender system, there are three main methods used to calculate the similarity between two users: the cosine similarity, adjusted cosine similarity, and pearson similarity [
24]. In this study, the pearson similarity will be used, which is calculated through the common rating items between any two users. The pearson similarity is shown as follows:
where
and
represent the ratings of user
u and user
v on the
i-th item, respectively;
and
represent the average of all the ratings of user
u and user
v, respectively.
2.3. Generate Recommendation Set
Before rating prediction to generate recommendation, it is necessary to determine the target user’s similar neighbor set. A similar neighbor set refers to the set of users who have similar preferences with the target user. In the recommendation system, the most
K similar users are usually selected as the nearest neighbor set to form the similar neighbor set of the target user [
25].
After the neighbor set of the target user is selected, it combines with all the neighbors’ ratings of the items and the similarity between the users to predict the target user’s ratings on the test set. The rating prediction is calculated as follows [
26]:
where
represents the prediction rating of user
u for unknown item
i;
is the set of
K users most similar to user
u; and
represents the set of users who have rated item
i. After rating prediction, select the
N items with the highest rating from the predicted rating set as the recommendation results to the target users, and the recommendation process ends [
27].
3. Proposed Method
As introduced in
Section 2, the traditional user-based collaborative filtering algorithms usually only use the user’s rating information, but ignore the impact of other aspects of user information and popular items on user similarity. To deal with these problems, an improved collaborative recommendation algorithm (defined as ICFTU) is proposed by combining the Term Frequency-Inverse Document Frequency (TF-IDF) method and user characteristics model. The overall framework of the proposed algorithm is shown in
Figure 3, which has three main parts, namely the improved TF-IDF-based method, the improved user characteristics model, and the proposed fusion strategy. The proposed method will be presented as follows.
3.1. Improved TF-IDF Based Method
The traditional collaborative filtering algorithm calculates the user similarity matrix based on the user’s rating data of items, which is easily affected by popular items. For example, “Shawshank Redemption” is a very good movie. If the user
A and user
B both gave the movie “Shawshank Redemption” 5 points, the traditional collaborative filtering algorithm will come to the conclusion that the user
A and user
B have high similarity. However, the fact is not necessarily the case. As we know, the same behavior of users on popular items does not mean that they have similar interests. On the contrary, if two users have taken the same behavior on unpopular items, it is more likely that their interests are similar. For example, if both users
A and
B have watched a relatively small number of movies, such as musicals, then they can be considered to have similar interests. Therefore, in order to eliminate the impact of popular items on the user similarity, the Term Frequency-Inverse Document Frequency (TF-IDF) method is applied to the traditional collaborative filtering algorithm in this paper, which is used to punish the popular items in the user behavior list. The main reason to use the TF-IDF method is that it is suitable for the problem of weight extraction. In addition, the TF-IDF method is simple and easy to calculate [
28].
TF-IDF is a statistical method, which is often used to evaluate the importance of a word to a file. The importance of a word is directly proportional to the number of times it appears in the file, but at the same time, it is inversely proportional to the frequency it appears in the file library [
28]. Based on the principle of TF-IDF, an improved user similarity calculation method is proposed to reduce the weight of the impact of popular items on the user similarity. If an item appears in the user’s behavior list, but it also appears many times in other users behavior list, this item is regarded as a popular item, and its impact on the user similarity should be punished. The weight of the
i-th item in this paper is calculated as:
where
represents the number of times that the
i-th item appears in the behavior list of user
u;
represents the length of behavior list of user
u;
represents the total number of users; and
represents the number of times that the
i-th item appears in all of the user behavior lists.
Then the weight of the item is introduced into the equation of Pearson similarity (see Equation (
2)), and an improved similarity calculation method is obtained as:
3.2. Improved User Characteristics Model
In real life, people living in the same area tend to have similar lifestyle and eating habits, while people in different areas may show greater differences. Similarly, if two people’s characteristics are more similar, such as gender, age, and occupation, then their interests are more likely to be similar. For example, there will be more common topics between students, but students and teachers may have different interests due to their different work and social experiences. Therefore, it is reasonable to recommend a user preference item to other users similar to their characteristics when making recommendations. There are some improved collaborative filtering algorithms, which have used the user’s characteristics information. However, there are still some problems in the existing method, for example, the similarity of age and occupation is calculated in a crude way, which makes the recommendation results have some limitations [
29].
To deal with these problems above, an improved user characteristics similarity model is set up in this paper, which is based on the fuzzy membership method. The proposed user characteristics similarity model can alleviate the cold start problem of the recommendation system caused by the lack of rating data for new users. The user characteristics similarities in this study are defined as follows:
Suppose that if the age difference is less than 5 years, the similarity is regarded as 1, and if the age difference is more than 25, the similarity is regarded as 0. The fuzzy membership for the age similarity of users is defined as follows:
- (2)
Occupation similarity
The traditional method of occupation similarity calculation is that: if the occupation is the same, the occupation similarity is set as 1, otherwise it is set as 0. Although it can measure the similarity of two users to a certain extent, the user’s characteristics are not fully exploited. In this paper, a tree diagram for the classification of occupations is set up first based on the international standard classification of occupations [
30], which is shown in
Figure 4.
In this occupation classification tree, the distance between two nodes is defined as the number of edges between these two nodes. The distance between the parent node and child node is 1, and the distance between adjacent brother nodes is 2. The distance between the farthest two occupations in the occupation classification tree is defined as
. Then the fuzzy membership for the occupation similarity of users is defined as follows:
where
is the occupation distance between users
A and
B; and
is the correction coefficient, which is adjusted dynamically according to the occupation.
- (3)
Gender similarity
Different gender users have different preferences for items, so the gender should be taken into account when calculating the similarity of user characteristics [
29]. Assuming that the gender of user
u is
, and the gender of user
v is
, the gender similarity of users is calculated as:
- (4)
User characteristics similarity
Combining the mentioned characteristics similarity of users in different dimensions based on the age, gender, and occupation, the final characteristics similarity of users is calculated as:
where
and
are the similarity weights for the user’s age, gender, and occupation. For different recommender systems, these weights can be adjusted dynamically to achieve the optimal recommender effect.
3.3. Proposed Fusion Strategy to Generate Recommendation
Based on the improved similarity calculation method above, the final user comprehensive similarity calculation method can be obtained by weighted fusion, namely:
where
, and
represent the weights for the similarity obtained based on the TF-IDF method and the user characteristics. For different recommender systems,
and
should be optimized. In this paper, a searching algorithm is proposed to obtain the optimal values of
and
, which is shown in Algorithm 1:
Algorithm 1 Optimal solution search algorithm |
- 1:
; %Initialization parameters - 2:
- 3:
- 4:
- 5:
else: - 6:
%Model fusion - 7:
- 8:
- 9:
%Parameter update - 10:
- 11:
- 12:
- 13:
- 14:
|
After obtaining the user’s comprehensive similarity
,
K users which are most similar to the target user are selected as the nearest neighbors to form the similar neighbor set of the target user. Combined with the rating information of all neighbors and the similarity with the target user, the target user’s rating on the
i-th unknown item is predicted. In this study, the rating prediction in (
3) is changed to:
The total work processing of the proposed collaborative recommendation algorithm is summarized as follows:
Step 1: Preprocess the rating data and construct the user-item rating matrix ;
Step 2: Use TF-IDF method and rating data to calculate the user similarity matrix ;
Step 3: Use user characteristics information to calculate the user characteristics similarity matrix ;
Step 4: Fuse the similarity matrices from Step2 and Step3 to generate the final user comprehensive similarity matrix ;
Step 5: After the comprehensive similarity matrix of a user is obtained, the nearest neighbor set of the target user is selected to make rating prediction and generate recommendations.