Movie Recommendation through Multiple Bias Analysis †

: A recommender system (RS) refers to an agent that recommends items that are suitable for users, and it is implemented through collaborative ﬁltering (CF). CF has a limitation in improving the accuracy of recommendations based on matrix factorization (MF). Therefore, a new method is required for analyzing preference patterns, which could not be derived by existing studies. This study aimed at solving the existing problems through bias analysis . By analyzing users’ and items’ biases of user preferences, the bias-based predictor (BBP) was developed and shown to outperform memory-based CF. In this paper, in order to enhance BBP, multiple bias analysis (MBA) was proposed to efﬁciently reﬂect the decision-making in real world. The experimental results using movie data revealed that MBA enhanced BBP accuracy, and that the hybrid models outperformed MF and SVD++. Based on this result, MBA is expected to improve performance when used as a system in related studies and provide useful knowledge in any areas that need features that can represent users.


Research Flow of Recommendation System
Search engines have developed, thanks to the spread of the Internet, and the online transaction market has developed into a digital content market, thanks to the spread of smart devices. Consequently, more costs are now required for users to find information. Information retrieval (IR) [1] is a field to search for documents that are suitable for users' information needs, and recommender systems (RSs) are an application field that is derived from IR. The RS development has gone through a process, like that of IR, since RS developed and applied IR.
The main issues of CF include cold start [15,16], scalability [2,6,17], and accuracy. The cold start refers to a problem in which no recommendation is possible due to no record of decision-making for new users or new items-meaning that no items can be recommended to new users, and new items cannot be recommended to any users. Content-based CF is a representative model for solving cold-start problems, and it enables recommendations through metadata analysis, even without a record of decision-making.
Scalability is a complex problem occurring in the real world and it entails difficulty in applying a small-scale research model to worksite operations. This problem is mitigated by simplifying or parallel processing the CF algorithm. In general, studies [2,6] have been conducted for executing CF algorithms in a distributed/parallel processing environment,

Research Motivation
Users' ratings (decision-making) of movies are determined by the effects of various elements, such as the differences in the degree to which users give high or low scores on average, users' tastes, cinematic quality, popularity, etc. The traditional CF principle is finding correlations from the mathematical patterns appearing in ratings and searching for similar movies. The attributes of users and items are critical because individual subjectivity is reflected in movie selection. Because users have different tastes for age, gender, occupation, genre, director, and actor, the probability of selecting the same movies differs according to users. Elements, such as a mania for a certain genre or series, fans of the director or actor, for example, are reflected in movie selection.
Bias means learning to a side, b for axis movement in y = ax + b, distortion [29], custom [30], taste [31], and individual subjectivity [32,33]. Because bias exists in all data in reality, the bias is computed in data mining and machine learning for model optimization. For example, there are newspapers that write newspaper articles leaned to one side, and sample groups participating in a questionnaire survey or the processes of reducing the sample groups are affected by the bias. Additionally, for content (newspaper articles, music, movies, videos, webtoons, etc.), people's bias is reflected in their content rating or review. Alternatively, people's bias is reflected in other users' ratings or review evaluations. Although positive/negative views of bias conflict with each other, bias is among the facts in which users' preferences are clearly expressed.
Bias-Based Predictor (BBP) [32] is a method that reflects the foregoing to analyze bias. Appropriate bias means the average learning computed from fair criteria (the mediator). BBP proposed a method for finding moderators and outperformed memory-based CF by analyzing user and item biases. However, because decision-making, in reality, reflects more complex relationships than BBP, BBP should be expanded to analyze more relationships [33].
Multiple Bias Analysis (MBA) analyzed users' preference patterns centering on bias to find new clues, undiscovered by existing CFs, and tried to understand users' decisionmaking processes using metadata. The bias can be rationalized to the relevant user himself/herself, but not to other users. For example, evaluations of newspaper articles (comments) and postings on portal sites (blogs) may be sympathized by users who are like the relevant user by chance, but, in other cases, they are an act of forcing others. In particular, because the nature of bias is clearly distinguished in social and political issues, bias cannot be justified to others. Furthermore, the issue of dataset fairness should be considered together. The propensity of a certain portal site can be affected by the users constituting the portal site, and the model that is learned from this data can produce unfair results for certain users. Additionally, in contrast to the intention, a distorted result may occur while reducing the data size. MBA used BBP as a preceding study to consider these elements. BBP searches for pinpoint scores to alleviate this problem. MBA for analyzing more relationships than BBP using metadata and a hybrid model that combines MBA and MF are proposed. The core of MBA is to find a feature that can represent users, and feature bias was analyzed by two ways; sympathized with or independent of other users. The experimental results using movie data showed 9.03%, 8.20%, 7.23%, 4.68%, and 0.97% higher accuracy of MBA than UBCF, IBCF, BBP, MF, and SVD++, respectively. Section 2 reviews MF and BBP, Section 3 introduces MBA, Section 4 presents the experimental results, and Section 5 deals with conclusions.

MF: Matrix Factorization
MF [8,10] is a sort of model-based CF. It became known through 'NetFlix Prize 09' and it has been considered the best model to date. Although SVD [9,10] is used because MF can decompose only square matrices, the two terms MF and SVD are used interchangeably.
When a rating matrix R m×n has been given for m users and n items, the core of learning is decomposing the original matrix R into several f -dimensional matrices (MF: p u , q i ∈ R f , SVD: U, Σ, V ∈ R f ) and ensuring that the decomposed matrices are to the original matrix. Equation (1) is the rating prediction, Equation (2) is the objective function, and the model is learned through Equations (3)- (5). p u is the users' latent factor (users' vector), q i is the items' latent factor (items' vector), λ is regularization, and γ is the learning rate.
SVD++ [8,10] refers to a model in which bias and temporal score [31] were added to MF [9]. Equation (6) is the prediction of rated scores, Equation (7) is the objective function, and the model is learned through Equations (8)- (13).
= P SVD++ (u, i) − r(u, i) ∀j ∈ R(u) : MF and SVD++ are implemented and distributed in many Open APIs. In this study, the experiments on MF and SVD++ were conducted using MyMediaLite [34].

BBP: Bias-Based Predictor
BBP [32] analyzes preferences based on the viewpoint of bias, assuming that bias and preferences are closely related to each other. The core of BBP is finding a pinpoint score to act as a fair moderator and analyzing user and item biases through the pinpoint score. The pinpoint score is expressed as s, and the model is learned, as shown in Algorithm 1 using Equation (21). Equation (14) shows the rating prediction of BBP.
P refers to the prediction score, wt refers to the weight type, u refers to the target user, i refers to the target item, s refers to the pinpoint score, and b refers to the bias. P wt (s, u, i) refers to the predicted score of item i of user u computed through s and wt, b wt (s, u) refers to the bias score of user u computed through s and wt, and b wt (s, i) refers to the bias score of item i computed through s and wt. Three weight types are used: weight based on rating frequency (RF), on amplified RF (ARF), and on logarithmic RF (LRF), as defined in Equations (15)- (20), below.
The difference between the pinpoint score and average rated score is used as the bias, and the reliability of the average rated score is reflected using the rating frequency as a weight. RF, ARF, and LRF were used as the types of weights of the rating frequency and they were indicated by substituting the wt digit into the equation. Equations (15)- (20) show the bias scores using the rating frequency weights.
RF was used to directly reflect the weight according to the rating frequency. The weight value is computed in a range of 0-1, depending on the rating frequency. The larger the size of the dataset, the higher the accuracy when compared to LRF.
b RF (s, u) is the bias score of user u that is computed through s and RF, I(u) is the set of items rated by user u, andr(u) is the average of the rated scores by user u. b RF (s, i) is the bias score of item i computed through s and RF, U(i) is the set of users who rated item i, andr(i) is the average of the rated scores of item i.
ARF was used to amplify and reflect the rating frequency weight of RF. The weight value is computed as −1 to 1. Because RF accuracy affects ARF, ARF has a problem that the deviations of accuracy among users are bigger than RF-it shows more accurate results for users for whom it predicts scores well while showing more inaccurate results for users for whom it predicts scores poorly.
LRF was used to reflect the weight of the rating frequency of RF in log form. The core reflects the log of the value normalized by the maximum rating frequency, and the weight value is computed in a range of 0-1. The accuracy is generally high, and, the smaller the size of the dataset (MovieLens 1M or less), the better the results shown.
RFMax(User) refers to the rating frequency of user v with max v∈User |I(v)| and maximum rating frequency. Likewise, RFMax(Item) refers to the rating frequency of item j with max j∈Item |U(j)| and maximum rating frequency.
The pinpoint score is computed while using Equation (21) as the objective function.
τ is the set of all rated scores included in the training set, and µ is the average rated score of training set τ. The core of Equation (21) is inducing the search of the selected pinpoint score s x adjacent to µ and Equation (21) was designed to add the penalty score in order to avoid overfitting to root mean square error (RMSE).

MBA: Multiple Bias Analysis
Users' and movies' attributes affect users' decision-making for movies. However, MBA could not but differ from BBP, because MBA limits users' attributes to age, gender, and occupation, and movies' attributes to actors, directors, and genres, since MBA is built with only the information that is provided by the experimental dataset.
When a user selects a movie, variables, such as whether he/she is a fan of the actors or the director, his/her preference for a certain movie genre, empathy according to the user's gender or occupation, etc., are involved in the selection [33]. Because BBP only analyzes user and item biases, it cannot analyze user decision-making regarding multiple biases. The core of MBA is that it expanded BBP to analyze user decision-making regarding multiple biases.

Vanilla Model
From Figure 1, MBA was expanded to reflect the generalized feature (GF) and personalized feature (PF) on BBP, and Equation (22) shows the rating prediction of the Vanilla model. The wt and s that are used in Equation (22) are the weight type and pinpoint score introduced in Section 2.2. MBA optimizes s using Algorithm 1 in the preprocessing stage before computing GF and PF. GF means a bias (global bias) sympathized with other users, and PF means a bias (local bias) independent of other users. Regarding Jim Carrey, GF refers to the public's tendency toward Jim Carrey, and PF refers to the personal tendency toward Jim Carrey.
GF represents the tendency of a feature that is sympathized with other users. The GF-Score was computed, as shown in Equation (23), and it was designed to reflect the distance ω between the feature bias b(s, f ) and user u computed from all users.
UF(u) is a feature set for the user u's age, gender, and occupation, and IF(i) is a feature set for the actor, director, and genre of item i. Here, because the elements of IF(i) are a set (actor, director, genre), it was designed to add the average of the target feature TF. R( f ) is the set of rated scores belonging to feature f , r(v, j) is the rated score for item j of user v, ω(u, f ) is the weight of feature f of user u, and ω(i, f ) is the weight of feature f of item i. The range of the values of the feature weight ω is −1 to 1, and the value that is computed through Equation (26) is used in Equations (28) and (29). In GF, ω is optimized to compute how close the feature f was to the popular bias.
If user u is a 30-year-old male student, UF(u)={Male, 30(Group-3), Student} will be used. The first term of GF can be disjoined and written, as follows: +ω(u, Gender(u)) · b(s, Gender(u)) +ω(u, Occupation(u)) · b(s, Occupation(u))) If item i is the movie The Truman Show, IF(i)={(Jim Carrey, · · · ), Peter Weir, (Comedy, · · · )} will be used. The second term of GF can be disjoined and written, as follows: When feature f is Jim Carrey, the b(s, f ) that is used in GF-Score is the bias (global bias) of Jim Carrey regarding all users, and ω(u, f ) is the distance between user u and b(s, f ) (or importance).
PF represents a personalized tendency. The PF-Score was designed to be computed, as shown in Equation (24), and it reflected the reliability ω for feature bias b(s, f ) and b(s, f ) computed only for itself. Here, because PF is computed only for itself, all UF(u)s becomē R(u) − s. Therefore, it is computed using only IF(i).
The range of ω values used in PF is −1 to 1, and the values that are computed through Equation (27) are used for learning of Equation (29). In PF, ω is optimized to compute how certain his/her own feature bias was. Because PF is only computed using his/her own ratings, as shown in Equation (24), individual subjectivity is computed. However, because the rating frequency is below that of GF computed from the ratings of other users, PF is less reliable than GF. Therefore, if the value of ω(u, f ) of PF is close to 1, the feature bias can represent himself/herself and, otherwise, the feature bias is unreliable.
MBA learns by reflecting error values to ω as much as an arbitrary number of iterations (Iter), learns using all of the training set τ, and computes while using Equation (25).
The r(u, i) used in Equation (25) is an element of τ, and P V M wt (s, u, i) is a predicted value. Equation (25) is computed to decrease the ω of MBA when (s, u, i) is below 0, and increase the ω of MBA when (s, u, i) exceeds 0. (s, u, i) is used in Equations (26) and (27), (s, u, i, GF), and (s, u, i, PF) are computed, reflecting the ratios of GF and PF.
The λ used in Equations (26) and (27) is the learning rate, and |GF(s, u, i)| and |PF(s, u, i)| are the absolute values of score. The value computed with Equation (26) learns about the ω used in GF through Equations (28) and (29), and the value computed with Equation (27) learns about the ω in PF through Equation (29). Because PF is computed using only IF(i), only Equation (29) is used to learn the ω. The TF( f ) that is used in Equations (28) and (29) means the output of the upper sub-feature to which feature f belongs. If feature f is an action, the output of TF( f ) will become Genre. MBA learns to optimize ω through Equations (25)- (29) and, thereafter, predicts the rated scores using Equation (22).

Heuristics Approach
In MBA, the ω used in GF and PF showed deviations in accuracy regarding the initial values. The heuristics approach deals with the content that is related to the feature weight ω used in GF and PF to enhance MBA accuracy.
From Figure 2, the ω to which the heuristics approach is applied is divided into feature weight f w that is used in MBA and weight type bwt used in BBP.

•
Case A: deature weight f w is initialized by reflecting the concept of the weight type of BBP introduced in Section 2.2.
Case A is initializing the feature weight ω used in GF and PF by reflecting the concept of the weight type of BBP and observing MBA accuracy. For f w(RF), the initial value of the ω(u, f ) of GF is computed, as follows: is the set of rated scores that belong to feature f in the training set τ. Because the number of weight types of BBP is 3 in total (RF, ARF, and LRF), the experimental results for Case A have the number of cases of 9 (9 = 3 × 3 = |bwt| × | f w|).

•
Case B: feature weight f w is initialized by unifying it into an arbitrary value.
In Case B, the feature weight ω(u, f ) used for GF and PF was changed by 0.1 per time from −1 to 1 to observe changes in accuracy. The experimental results for Case B have the number of cases of 63 (63 = 3 × 21 = |bwt| × | f w|).

Hybrid Model
The Vanilla model is combined with MF to propose a hybrid model to enhance the prediction accuracy by complementing the shortcomings of the two models. The hybrid model is made by performing preprocessing of MF (Section 2.1) and Vanilla model (Section 3.1) to optimize them, respectively, and then combining the two models and the hybrid model is optimized thereafter. Equation (30) is the rating prediction of the hybrid model, Equation (31) is the objective function of the hybrid model, and Equations (32) and (33) are the learning of the hybrid model. Equation (32) is computed to reduce the bias score that is used for prediction in Equation (33) if the predicted value exceeds the actual value, and increase the bias value that is used for prediction if the actual value exceeds the predicted value. The λ in Equation (32) is the learning rate, and the f in Equation (33) is the attribute regarding u and i.

Dataset
MovieLens 100K dataset [35] was used as rating data to evaluate accuracy. However, since there is no information on actors and directors in the MovieLens dataset, as shown in Table 1, the information on actors and directors was supplemented using the HetRec2011 dataset [36]. In the process of creating the experimental dataset, items that could not be referenced from HetRec2011 were removed, and a total of 89 items and 3792 ratings that belonged to them were removed. The experimental dataset used in our experiment is shown in Table 2. The ratings were aligned based on each user's timestamp, and were then composed into the five-fold cross-validation. For performance comparison, the average of the results of the evaluation scale computed from the five-set (five-fold cross-validation) was used. For the age groups of the users in Table 2, the users are divided into seven age groups as in MovieLens 1M.

Evaluation Metrics
For the evaluation scale, the root mean square error (RMSE) was used, as shown in Equation (34), and smaller values mean higher accuracy. In Equation (34), P wt refers to the predicted score that is computed through the model, and r refers to the rated score of the testset.

Results
MBA proposed the Vanilla model (Section 3.1) to improve BBP accuracy, tried the heuristics approach (Section 3.2) to improve the performance of Section 3.1, and attempted to supplement the shortcomings of the existing model through the hybrid model (Section 3.3). Therefore, the results of the MBA experiment were analyzed separately for the Vanilla model in Section 4.3.1, for the heuristic approach in Section 4.3.2, for the hybrid model in Section 4.3.3, and all of the experimental results were integrated in Section 4.3.4 to compare the accuracy.

Vanilla Model
MBA reconstructed the dataset to use metadata (Section 4.1). In this process, an experiment was conducted on the existing CF when considering changes in the environment of the dataset. Table 3 compares the previous study and the Vanilla model accuracies, and the optimized settings are also specified. RMSE results for different values of n and of Iter are shown in Figures A1 and A2 of Appendix A.
In the performance comparison, the existing method was shown as BBP(wt), and the proposed method was shown as MBA(wt), and wt means Weight Type (Section 2.2). For instance, BBP(RF) is a model using Equations (15) and (16), and MBA(LRF) is a model using Equations (19) and (20). n is the value that is used to search the reference point, Iter is the number of iterations for learning of ω, and λ is the learning rate used in Equations (26) and (27). Table 3. Root mean square error (RMSE) comparison of the methods used in the experiment. RMSE results of five-fold cross-validation of every method can be viewed in more detail in Table A1  In RF, the smaller the data size, the lower the pinpoint score optimization performance, because MBA computes multiple biases through pinpoint scores, inducing more errors. However, MBA(ARF) and MBA(LRF) outperformed UBCF and IBCF, and MBA(LRF) showed the best results. Through these results, it is observed that the Vanilla model performance is determined by the pinpoint score. Because the RF performance is good when MovieLens exceeds 10M [32], RF should be reevaluated through experiments using a large dataset. Additionaly, in the process of analyzing the Vanilla model, variations in performance were observed, depending on the initial value of ω (GF, PF). This problem occurred in all Vanilla models, and the heuristics approach was conducted, as follows, to alleviate the problem presented.

Heuristics Approach
In order to improve MBA performance, experiments were conducted using the method presented in Section 3.2. Figure 3 shows the experimental results for Case A and Figure 4 for Case B. bwt(RF) means that the weight type wt of BBP is used as RF, and f w(RF) means that the feature weight f w was initialized into RF. It is observed that Case A performance is the best when bwt(LRF) was conducted with f w(RF) (Figure 3), and that the performance variations of the bwt(LRF) series is the least. Case B showed the results where the RMSE of bwt(LRF) was the best (Figure 4), and RMSE converged when f w was 0.7 or higher. Additionally, Cases A and B showed better RMSE when compared to the Vanilla model. Based on the foregoing, the accuracy may be enhanced by improving the Vanilla model learning policy. Table 4 summarizes the heuristic approach results (-H) and compares the RMSE with the Vanilla model. From Table 4, it is seen that RMSE is improved through the heuristics approach. Additionally, when considering parameter n and Iter used for learning, it is seen that the cost used for learning is lower.

Hybrid Model
MF and SVD++ also showed excellent performance in the dataset that was used in the experiment. MF and SVD++ were tested using the MyMediaLite [34] (Available: http://www.mymedialite.net/ (accessed on 20 March 2021)), and the hybrid model and RMSE are compared in Table 5. The hybrid model results outperformed Tables 3 and 4 and showed the best performance result when HybridSVD++(LRF) showed in Table 5. MF and SVD++ both showed results in which RMSE was improved when combined with MBA.  Figure 5 shows the results of integration, summarization, and comparison of the RMSE results of Sections 4.3.1-4.3.3. In Figure 5, the x-axis represents the weight type, and the y-axis represents the RMSE. Here, UBCF, IBCF, MF, and SVD++, which are unrelated to the weight type wt, were specified with a line graph, and BBP and MBA that were affected by wt were specified with a bar graph. Through Figure 5, an improvement in the overall performance of MBA is seen when the weight type was set to LRF. When RMSE is compared based on LRF, it is seen that the Vanilla model surpasses BBP, the heuristics approach surpasses the vanilla model, the HybridMF outperforms MF, and the HybridSVD++ outperforms SVD++. SVD++ is evaluated to be the best among model-based CFs due to rigorous verification for a long time by related researchers. Because MBA's HybridSVD++ (LRF) outperformed SVD++, there was a new preference pattern that was previously unconsidered in the bias, and MBA analyzed it in order to identify why its accuracy exceeded that of SVD++.

Conclusions
The performance limit of recommendation systems follows analysis algorithm and data problems. The analysis algorithm problem is "What algorithm fits the given data?" If "an arbitrary dataset χ showed 91% performance when classified by support vector machine (SVM) and 85% when classified by k-nearest neighbor(k-NN)", SVM can be said to be good. Here, SVM and k-NN become the analysis viewpoints, and a hybrid model that adds viewpoints can be used to improve the performance. The data problem is a missing-value problem. Because rating is the result of decision-making regarding a user's subjectivity, preference, situation, and environment, consistency cannot be guaranteed. However, rating data are uncertain correct answer sheets, because information about the process whereby the user makes the decision is lacking. Therefore, the accuracy of recommendation systems should have converged on the current level because a limit to understanding decision-making exists.
When considering the process for a user to select a movie, actors, director, and genres can be important variables. If the user likes a certain genre, he/she is highly likely to prefer an actor thta is close to the action or to be a fan of the actor. For example, Arnold Schwarzenegger, Sylvester Stallone, and Keanu Reeves belong to action. Furthermore, for a director that is represented by a certain genre, actors related to the director can be considered. MBA's research motive is to observe this pattern through the rating frequency and the degree of tendency of ratings, and this was defined as a bias analysis. MBA is a model to analyze the foregoing, and through experiments, BBP accuracy, which is the basis of MBA, has been improved, and it was shown the result of Table A2 that the hybrid model outperforms MF and SVD++. Based on this result, it can be argued that the bias is reflected in users' decision-making for movies.
MBA was designed to learn the feature weight ω that was used in the user's GF and PF. When the ω values used for GF and PF were observed, it was identified that they could be used as the user's unique characteristic. Because HybridSVD++ (LRF) outperformed SVD++, which is limited to the foregoing, it can be a clue to explain the user's decisionmaking process using the ω values of GF and PF. Furthermore, using the ω values of GF and PF, it is possible to design a recommendation system that describes the reason for recommendation simultaneously with the recommendation. For example, when MBA has recommended the movie "Iron Man", how crucial "Actor-Robert Downey Jr." and "Genre-SF" were can be computed numerically. Reflecting this concept, it is possible to perform a data description that explains the reason for recommendation simultaneously with the recommendation.
Thus far, bias has been considered at the graph-axis movement level, such as y = ax + b and negative variables. Although using bias requires more observation and verification, given the influence of bias that is involved in decision-making in music, food, movies, books, news, etc., bias is an element that is more important than the graph-axis movement. Additionally, recently, studies on fairness models, noting potential biases in data and analysis models, are emerging, and the importance of studies that are related to fairness is increasing. Fairness is a critical issue in bias analysis. While checking the limitations of the vanilla model through the heuristics approach, the MBA showed results with small variations in accuracy when a pinpoint score appropriate as a mediator was selected. When synthesizing the results and summarizing future research plans to improve MBA performance, the MBA accuracy is expected to improve through studies on fairness and by enhancing the Vanilla model learning method.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. For each method, 5-fold cross-validation was performed (1)(2)(3)(4)(5) and RMSE values are shown and their averages and variances are also shown.

Method
Validation Set