Augmenting Black Sheep Neighbour Importance for Enhancing Rating Prediction Accuracy in Collaborative Filtering

: In this work, an algorithm for enhancing the rating prediction accuracy in collaborative ﬁltering, which does not need any supplementary information, utilising only the users’ ratings on items, is presented. This accuracy enhancement is achieved by augmenting the importance of the opinions of ‘black sheep near neighbours’, which are pairs of near neighbours with opinion agreement on items that deviates from the dominant community opinion on the same item. The presented work substantiates that the weights of near neighbours can be adjusted, based on the degree to which the target user and the near neighbour deviate from the dominant ratings for each item. This concept can be utilized in various other CF algorithms. The experimental evaluation was conducted on six datasets broadly used in CF research, using two user similarity metrics and two rating prediction error metrics. The results show that the proposed technique increases rating prediction accuracy both when used independently and when combined with other CF algorithms. The proposed algorithm is designed to work without the requirements to utilise any supplementary sources of information, such as user relations in social networks and detailed item descriptions. The aforesaid point out both the efﬁcacy and the applicability of the proposed work.


Introduction
Collaborative filtering (CF) is a dominant recommender systems technique that consider users' likes and tastes, expressed as item ratings, to create personalised recommendations [1,2]. It ensembles likes from users having similar tastes/preferences, termed as 'near neighbours' to compute rating predictions for items, which then lead to recommendations. Since rating prediction accuracy is directly related to recommendation usefulness and reliability, a major challenge that collaborative filtering systems confront is the enhancement of rating prediction accuracy.
The two main categories of CF algorithms are the memory-based and the modelbased ones [3,4]. The algorithms that belong in the first category exploit user rating data to compute the similarity between users (or items). The ones that belong in the second category develop models using various techniques (derived from data mining, machine learning, etc.) [5][6][7]. Furthermore, there are many hybrid approaches, combining the features of the aforementioned two main categories.
Each of the three approaches has its own advantages: (a) the memory-based approach is characterized by ease of creation and use, high explainability of results, easy incorporation of new data, content-independence of the items being recommended and good scaling with co-rated items; (b) the model-based approach is characterized by matrix sparsity is a simple and straightforward procedure, it modifies the weight/importance of each NN, based on the aforementioned concept, i.e., based on the relative (un-) acceptance of their commonly rated items, in combination with their ratings on these items. The rationale behind the usage of the aforementioned concept derives from the fact that when in real life a human likes a product (e.g., a TV series, a car model, a videogame) that the majority of others do not (and hence this product obtains a relatively low average rating value), and this human finds another one that also likes the exact same product, the probability of valuing his opinion with greater prominence than the opinions of other users, for a future recommendation, is relatively high. This rationale is in line with the use of the inverse document frequency (IDF) metric in information retrieval, where terms occurring less frequently in the document corpus are assigned higher weights [34].
To validate our approach, an extensive evaluation is presented, using (i) two user similarity metrics, (ii) two rating prediction error metrics, and (iii) six datasets that are widely used in CF research.
It is worth mentioning, that the proposed approach (i) does not need any kind of supplementary information, apart from users' ratings on items, and hence can be applied in any CF dataset and (ii) can be fused with other CF approaches, aiming to enhance rating prediction accuracy or efficiency, either using supplementary sources of information, such as users' relations in social networks and detailed characteristics of items [35][36][37] or not [38][39][40].
The rest of the paper is structured as follows: in Section 2 the related work is overviewed, while in Section 3 the proposed algorithm is introduced. In Section 4 the methodology for tuning the algorithm's operation is reported, as well as the presented algorithm is evaluated. Finally, Section 5 concludes the paper and outlines future work.

Related Work
The accuracy of CF-based recommender systems is a research field that has attracted numerous research works over the last years, which are divided into two main categories. The first category includes research works which exploit supplementary sources of information, such as user relations in social networks and item characteristics, while the second category includes research works that are based solely on the information contained in the user-item rating matrix.
In regard to the first category, [41] examines the impact of incorporating social ties in the prediction formulation, targeting prediction accuracy and presents a social network CF algorithm which tunes the contribution of the social information by using a learning method as a weight parameter in the proposed similarity measure. The work in [42] extracts information from distant social relations and captures opinions from users while modelling their interactions with items, introducing a deep social CF algorithm that exploits social network information for recommendation production. Ref. [43] proposes a method that overcomes the cold start problem and the data sparsity in CF, by designing a Matrix Factorization (MF) Linked Open Data model, which uses a knowledge base to find information concerning the new entities. Ref. [44] states that both group affiliation information and social network information may significantly enhance the accuracy of popularity-based voting recommender systems as well as introduce a set of NN-based and MF-based recommender systems for online social voting. The work in [45] combines tie strength with social network information to create a local random walk-based friend recommendation method. Initially, the basis for friend recommendation is constructed, by using a weighted friend network and then this network is used to compute user similarity by a local random walk-based similarity measure. Ref. [46] firstly introduces a sparsity alleviation approach, based on implicit and explicit satisfaction and uses objective and subjective trust, to establish enhanced trust relationships among users. Then, for each target user, it selects the user's trusted neighbours, which are screened using emotional consistency. Finally, it predicts item ratings to obtain the final recommendations lists. Ref. [36] combines time decay factor for rated items, cognition relationships between users, and personal cognition behaviour into a unified probabilistic MF model and presents a social MF method for personalised recommendation using social interaction factors.
Although all the previous works achieve relatively high rating prediction accuracy improvement, the supplementary source of information required may not always be available. As a result, an algorithm which can work using only the information located in the user-item-rating matrix may prove to be more appropriate, since it can be applied to every CF dataset.
To this extent, ref. [47] introduces an approach that realises an item-variance weighting in item-based CF. More specifically, it applies a time-related correlation degree to form time-aware similarity computation, which estimates the relationship between two items and reduces the importance of an item that has not recently been rated. Ref. [48] presents a CF optimization method that initially incorporates multiple interests to optimize neighbour selection and then it utilises a ranking strategy that rearranges both the top-N item list and the area the threshold controls, maximising the popularity while maintaining a relatively low prediction accuracy reduction. Ref. [49] clusters items and users by using a Gaussian mixture model and builds a new interaction matrix by extracting new item features that manages to solve the impact of rating data sparsity on CF algorithms. Furthermore, by combining the Jaccard and triangle similarities, it proposes a new similarity calculation algorithm. Ref. [50] presents a slope one algorithm based on user similarity and trusted data fusion that can be applied in various CF systems. For the creation of the final recommendation formula, the proposed algorithm includes the procedures of trusted data selection, user similarity calculation and inclusion of this similarity to the weight factor of the improved slope one algorithm. Ref. [51] presents a local similarity algorithm that can use multiple correlation structures between CF users. Firstly, it uses a clustering method to discover groups of similar items and then, for each cluster, it creates a user-based similarity model, namely Cluster-based Local Similarity. Ref. [52] introduces a CF algorithm that exploits repetitive purchased products and symmetric purchasing order, to tackle user big data. The presented algorithm combines a word2vec mechanism with a gradient boosting machine learning architecture to explore the purchased products based on users' click patterns. In [53], a product recommendation method for CF based on the triangle similarity is presented. The similarity metric considers the ratings of both the non-commonly rated items from pairs of users as well as the common rated ones. It is further complemented with the user rating preference behaviour in giving rating preferences. In [54], a CF rating prediction algorithm is introduced that modulates the rating prediction numeric value, based on the relation between the period the rating to be predicted belongs to, in a certain product category and the users' experienced wait period in the same product category, targeting at enhancing the prediction accuracy of CF systems.
Still, none of the aforementioned research allow for the aspect of users that share a positive or negative opinion about an item, but are outliers when compared to the majority of the users in the dataset. The present work fills this gap by introducing an algorithm that modifies the weight/importance of each NN to the prediction value, based on the relative (un-)acceptance of their commonly rated items, in combination with their ratings on these items, and assessing its performance both used independently and combined with another CF algorithm also aiming at enhancing rating prediction accuracy.

The Proposed Algorithm
The procedure that a CF algorithm typically follows, when predicting a rating for user U includes three main steps: 1.
Find users having close/similar tastes with U, by examining the similarity of already submitted ratings in the rDB, to identify U's near neighbour (NN) users; these users will operate as recommenders to U. Typically, in CF systems, the metrics used to quantify user similarity, is the Pearson correlation coefficient (PCC) and the Co- sine Similarity (CS) [55,56], which are expressed as shown in Equations (1) and (2), respectively: Generally, for a user V to be considered as U's NN, their quantified rating similarity value has to exceed a specific threshold, e.g., the value 0.0 for the PCC metric [55].

2.
Predict the rating value that U would give to an item i; in order to compute the rating prediction p U,i , the standard CF rating prediction formula [26,57] is typically applied: The weight/importance of each NN to the prediction value is based only on its numeric similarity with U, calculated during the previous step.

3.
Recommend to U the items having the highest prediction values; the number of recommended items is determined by the administrator of the recommender system [58,59].
The proposed algorithm targets at augmenting the importance of each NN V, when V and U (a) mutually agree on their opinion on some items and (b) their opinion on the same items deviates from that of the majority of users; to this end, the proposed algorithm adjusts the rating prediction formula given in the second step above (Equation (3)).
More specifically, the proposed algorithm modifies Equation (3) by considering a black sheep factor bsf(U, V) between user U, for whom the rating prediction is computed, and each of his NNs, V, as shown in Equation (4): Effectively, the bsf(U,V) factor is an adjustment assigned to each NN's contribution to the prediction computation, based on the degree to which users U and V mutually agree on the rating of items, while at the same time disagreeing with the majority of other users on the same items.
For the application of this algorithm, the bsf(U,V) factor needs to be determined; the setting of the bsf factor to its optimal value is experimentally explored in the following section, along with the prediction accuracy gains of the proposed algorithm.

Algorithm Tuning and Experimental Evaluation
In this section, we report on our experiments aiming to: 1.
Determine the optimal value of the bsf factor, to tune the proposed algorithm and; 2.
Evaluate the accuracy of the rating prediction of the proposed algorithm, both when used independently and when combined with a state-of-the-art CF algorithm also aiming at rating prediction accuracy improvement.
For the evaluation of the rating prediction quality, both the RMSE and MAE error metrics have been employed. Their quantification was accomplished using the standard "hide one" technique, where a rating from each user in the database is hidden and its value is tried to be predicted [60][61][62]. In our work, this experiment was executed twice, the first time a random rating was hidden for each user, while in the second experiment, his last rating was hidden (considering the ratings' timestamps in the rDB). These two experiments produced very close results (less than 1% difference observed), hence, we report on the results from the first experiment, for conciseness. The practice described above is the typical one when evaluating a rating prediction CF algorithm [31,63,64].
Our experiments were executed on six datasets; four of these were obtained from Amazon [65,66], the fifth was sourced from MovieLens [67,68], while the last was sourced from NetFlix [69]. Regarding the four Amazon datasets, we used the 5-core ones, where each user and item have at least 5 ratings, to ensure that, unlike in the simple Amazon datasets where for some users and items only one rating exists in the rating database, at least 4 other ratings exist and hence the application of any CF algorithm can produce valid results. The four Amazon datasets are considered relatively sparse (their density is less than 0.1%), while the MovieLens and the NetFlix ones are considered relatively dense (their density is greater than 1%). We opted to use both sparse and dense datasets in order to confirm the applicability of the presented algorithm in every CF dataset, regardless of its density. Table 1 is a synopsis of the datasets utilised in this work. The aforementioned datasets are widely used in CF research [70][71][72][73] and they contain each rating's timestamp (essential information for hiding each user's last rating), while at the same time they vary considering their item domain category (music, videogames and TV series, books and movies).

Determining the Algorithm Parameters
The goal of first experiment is to determine the optimal value for the bsf factor used for the rating prediction formula. To do so, we examined more than 40 candidate setting alternatives, however, we report only on the most indicative ones for conciseness. More specifically, Figure 1 illustrates the average prediction accuracy improvement for each alternative under different bsf factor settings, pertaining to the MAE and the RMSE scores. where each settingi corresponds to a different setting for the computation of the bsf factor as follows: 0.00% 0.50%  In the equations presented above, the following notations are used: • low_thr denotes the value below which a rating is considered to be negative; formally, is_negative(r U,i ) ⇔ r U,i ≤ low_thr • high_thr, correspondingly, represents the value above which a rating is considered to be positive. Formally, is_positive(r U,i ) ⇔ r U,i ≥ high_thr • blackSheepRatings(U,V) is the number of ratings where users U and V both have a positive (or negative) rating, while the user community has a negative (or positive), respectively, rating on the same item. Formally: where UC is the user community, i.e., the set of users in the dataset is the number of items that have been rated by both U and V; formally, numCommonlyRated(U, V = |{i ∈ I : r U,i = NULL ∧ r V,i = NULL}| In Figure 1 we can notice that the setting where the black sheep factor equals to 1.2 where two users U and V have at least 5% of their commonly rated items are considered black sheep ratings, where the low threshold (that a rating is considered relatively negative) equals to 2.5/5 and the high threshold (that a rating is considered relatively positive) equals to 3.5/5, and 0.9 otherwise, is the optimal one, since it achieves the largest rating prediction gains, for both error quantification metrics.
The corresponding experiment, using the CS user similarity metric produced similar results, where the optimal setting achieved a rating prediction error reduction of 2% for both the MAE and RMSE metrics.

Rating Prediction Accuracy Improvement Achieved by the Proposed Algorithm
After the proposed algorithm's optimal setting for the bsf factor has been experimentally determined, we present our findings regarding the performance gains in terms of rating prediction accuracy, stemming from the application of the proposed algorithm on the four datasets used in our evaluation (c.f. Table 1). Figure 2 presents the accuracy gains that the proposed algorithm achieves, in terms of the MAE and RMSE metrics, when using the PCC similarity metric and taking the performance of the plain CF algorithm as a yardstick.
both the MAE and RMSE metrics.

Rating Prediction Accuracy Improvement Achieved by the Proposed Algorithm
After the proposed algorithm's optimal setting for the bsf factor has been experimentally determined, we present our findings regarding the performance gains in terms of rating prediction accuracy, stemming from the application of the proposed algorithm on the four datasets used in our evaluation (c.f. Table 1). Figure 2 presents the accuracy gains that the proposed algorithm achieves, in terms of the MAE and RMSE metrics, when using the PCC similarity metric and taking the performance of the plain CF algorithm as a yardstick. The proposed algorithm achieves an average prediction MAE reduction of 2.1% and an average prediction RMSE reduction of 2.0%, when using the PCC user similarity metric. Examining each dataset individually, the performance edge of the proposed algorithm against the plain CF algorithm ranges from 1.3% and 1.3% (for the MovieLens 100K dataset) to 2.7% and 2.5% (for the Amazon "Videogames" dataset), for the MAE and the RMSE metrics, respectively. Figure 3 presents the accuracy gains achieved by the proposed algorithm in terms of the MAE and RMSE metrics, when using the CS similarity metric and again taking the performance of the plain CF algorithm as a yardstick. The proposed algorithm achieves an average prediction MAE reduction of 2.1% and an average prediction RMSE reduction of 2.0%, when using the PCC user similarity metric. Examining each dataset individually, the performance edge of the proposed algorithm against the plain CF algorithm ranges from 1.3% and 1.3% (for the MovieLens 100K dataset) to 2.7% and 2.5% (for the Amazon "Videogames" dataset), for the MAE and the RMSE metrics, respectively. Figure 3 presents the accuracy gains achieved by the proposed algorithm in terms of the MAE and RMSE metrics, when using the CS similarity metric and again taking the performance of the plain CF algorithm as a yardstick. The proposed algorithm achieves an average prediction MAE reduction of 2% and an average prediction RMSE reduction of 2%, as well, when using the CS user similarity metric. At the individual dataset level, the performance edge of the proposed algorithm against the plain CF algorithm, ranges from 1.2% and 1.1% (for the MovieLens 100K dataset) to 2.3% and 2.5% (for the Amazon "Videogames" dataset), for the MAE and the The proposed algorithm achieves an average prediction MAE reduction of 2% and an average prediction RMSE reduction of 2%, as well, when using the CS user similarity metric. At the individual dataset level, the performance edge of the proposed algorithm against the plain CF algorithm, ranges from 1.2% and 1.1% (for the MovieLens 100K dataset) to 2.3% and 2.5% (for the Amazon "Videogames" dataset), for the MAE and the RMSE metrics, respectively.

Combining the Proposed Algorithm with a Second Algorithm Targeting Rating Prediction Accuracy Improvement
As stated in the introduction, the proposed algorithm can be easily fused with other CF approaches, aiming to enhance rating prediction accuracy.
The rationale behind the evaluation of the combination of the proposed algorithm with another algorithm is that, currently, many recommender systems have been implemented and use diverse algorithms that aim to achieve increased accuracy. A recommender system administrator may wonder whether the algorithm employed in their system needs to be replaced by the proposed one, and which would be the resulting benefits, or whether the proposed algorithm may be combined with the one already employed and if so, what would the benefits be. As a result, the following experiment offers useful insight regarding the additional accuracy gains that may be reaped for existing recommender systems, if the proposed algorithm is incorporated to complement any existing algorithm(s).
Towards this direction, the third experiment aims at assessing the rating prediction accuracy improvement when combining the proposed algorithm with another CF rating prediction accuracy approach. In particular, we report on our experiments where the proposed algorithm is combined with the CF EPC algorithm [54]. The CF EPC algorithm is a state-of-the-art algorithm (published towards the end of 2020), also targeting at improving the CF rating prediction accuracy, and not needing any additional information on the items or the users (e.g., user social relationships or item categories). Hence, it can be also applied in all CF datasets. Figure 4 illustrates the improvement in the MAE achieved by the inclusion/combination of the presented algorithm to the CF EPC algorithm, when using the PCC as the similarity metric and again taking the performance of the plain CF algorithm as a yardstick.  The combination of the CFEPC algorithm with the proposed algorithm resulted in a relative improvement of 15%, on average in relation to the gains obtained when using the plain version of the CFEPC (from 6.8% to 7.8%, in absolute figures), considering the MAE error metric. Similarly, the relative improvement, considering the RMSE error metric has been found to be equal to 19%, on average (from 5.8% to 6.9%, in absolute figures). The  The combination of the CF EPC algorithm with the proposed algorithm resulted in a relative improvement of 15%, on average in relation to the gains obtained when using the plain version of the CF EPC (from 6.8% to 7.8%, in absolute figures), considering the MAE error metric. Similarly, the relative improvement, considering the RMSE error metric has been found to be equal to 19%, on average (from 5.8% to 6.9%, in absolute figures). The experiment demonstrates that the performance gains of the CF EPC algorithm is further enhanced by approximately the 50% of the performance gains achieved when the proposed algorithm is independently applied on sparse datasets (i.e., the Amazon datasets), while for the dense dataset (Movielens Latest 100K dataset) the performance enhancement of the CF EPC algorithm is approximately equal to the 25% of the gains achieved by the proposed algorithm on the same dataset. Figure 5 illustrates the improvement in the MAE achieved by the inclusion/combination of the presented algorithm to the CF EPC algorithm, when using the CS as the similarity metric and again taking the performance of the plain CF algorithm as a yardstick. The combination of the CFEPC algorithm with the presented algorithm resulted in a relative improvement of 14%, on average in relation to the gains obtained when using the plain version of the CFEPC (from 6.7% to 7.6%, in absolute figures), considering the MAE error metric. Similarly, the relative improvement, considering the RMSE error metric has been found to be equal to 15%, on average (from 6.1% to 6.9%, in absolute figures). The experiment demonstrates that the performance gains of the CFEPC algorithm is further enhanced by approximately 50% of the performance gains achieved when the proposed algorithm is independently applied on sparse datasets (i.e., the Amazon datasets), while for the dense dataset (Movielens Latest 100K dataset) the performance enhancement of the CFEPC algorithm is approximately equal to the 30% of the gains achieved by the proposed algorithm on the same dataset.

Complexity Analysis of the Proposed Algorithm
The procedure of computing the average rating value for each item, given by all users, is a procedure that can be easily executed offline (while loading the ratings from the rDB). In case the aforementioned procedure is selected to be executed online, its complexity is O(r), where r is the number of all user ratings in the rating database. When a new rating is added to the database, the complexity to update the average is O(1), since the new average can be directly computed on the basis of the current one and the number of ratings for the item, as shown in Equation (5):  The combination of the CF EPC algorithm with the presented algorithm resulted in a relative improvement of 14%, on average in relation to the gains obtained when using the plain version of the CF EPC (from 6.7% to 7.6%, in absolute figures), considering the MAE error metric. Similarly, the relative improvement, considering the RMSE error metric has been found to be equal to 15%, on average (from 6.1% to 6.9%, in absolute figures). The experiment demonstrates that the performance gains of the CF EPC algorithm is further enhanced by approximately 50% of the performance gains achieved when the proposed algorithm is independently applied on sparse datasets (i.e., the Amazon datasets), while for the dense dataset (Movielens Latest 100K dataset) the performance enhancement of the CF EPC algorithm is approximately equal to the 30% of the gains achieved by the proposed algorithm on the same dataset.

Complexity Analysis of the Proposed Algorithm
The procedure of computing the average rating value for each item, given by all users, is a procedure that can be easily executed offline (while loading the ratings from the rDB). In case the aforementioned procedure is selected to be executed online, its complexity is O(r), where r is the number of all user ratings in the rating database. When a new rating is added to the database, the complexity to update the average is O(1), since the new average can be directly computed on the basis of the current one and the number of ratings for the item, as shown in Equation (5): Regarding space complexity, the overhead introduced by the procedure is 1 real number per item (its average rating) and hence negligible.
The procedure of finding the number of black sheep ratings for each pair of NNs (to compute the bsf factor for this pair of users) has a complexity of O(#NNs * #commonRatings). According to [74,75], the top-K NNs are retained and the maximum number of NNs typically considered ranges from 20 to 60. The average number of common ratings for NNs pairs varies with the dataset, and the settings used to determine NNs; for instance, in [76] it is suggested that CF system implementors may opt to consider only NNs with at least 10 common ratings, to increase accuracy. In all cases, the computation of the bsf factor considers the common ratings for each pair of NNs, and therefore its complexity is identical to the computation of the similarity of the same pair of users, which is an integral step of the CF procedure; therefore, the introduction of the computation of the bsf factor does not affect the overall complexity of the algorithm. Notably, the computation of the bsf factor need only be performed between a user and his/her NNs (whose number is typically bounded by the K parameter of the top-K NN selection step), yielding significantly lower execution time than the computation of pairwise user similarities, which must be performed for all user pairs. The complexity of the rating prediction phase is not altered, as compared to the typical CF algorithm listed in Equation (3), since only one additional multiplication per considered rating is introduced.
Regarding space complexity, the overhead introduced by the need to maintain the bsf factor values is 1 number per each NN pair (the value of their black sheep factor) and can be easily accommodated in contemporary hardware.

Conclusion and Future Work
In this work, we have presented a novel CF algorithm that considers the information of the black sheep ratings between NNs in the CF rating prediction procedure for the improvement of the rating prediction accuracy. More specifically, a set of black sheep ratings between two NNs appears when they both like a generally unaccepted item (they both give a relatively high rating when compared to the relatively low average rating given by all database users for this item) or vice versa. The rationale behind the use of the aforementioned concept is derived from the fact that if a human likes an item (e.g., a TV series, a car model, a videogame) in the real world, that the majority of others do not (and hence this product obtains a relatively low average rating value), and this human finds another one that also likes the exact same product (which is a quite rare case), the probability of valuing his opinion with greater prominence than the opinions of other users, for a future recommendation, is relatively high.
We have experimentally validated the proposed algorithm through a set of experiments, using two user similarity metrics, namely the PCC and the CS (which are the two most used user similarity metrics in CF research [77][78][79]), two rating prediction error metrics, namely the MAE and the RMSE, and six datasets of diverse product categories (videogames, music, books, movies and TV series) to ensure the reliability and generalisability of the results. Furthermore, the proposed algorithm was tested both as a standalone application and combined with another CF algorithm also aiming at enhancing rating prediction accuracy [54]. The evaluation results have shown that significant prediction accuracy gains were introduced through the inclusion of the proposed algorithm. In the first case (standalone application) an average of 2% rating prediction error reduction was found, considering all cases. In the second case (when combined with another CF algorithm) the inclusion of the proposed algorithm achieved a further average rating prediction error