# Augmenting Black Sheep Neighbour Importance for Enhancing Rating Prediction Accuracy in Collaborative Filtering

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. The Proposed Algorithm

- Find users having close/similar tastes with U, by examining the similarity of already submitted ratings in the rDB, to identify U’s near neighbour (NN) users; these users will operate as recommenders to U. Typically, in CF systems, the metrics used to quantify user similarity, is the Pearson correlation coefficient (PCC) and the Cosine Similarity (CS) [55,56], which are expressed as shown in Equations (1) and (2), respectively:$$sim\_PCC\left(U,\text{}V\right)=\frac{{\sum}_{k}\left({r}_{U,k}-\overline{{r}_{U}}\right)\text{}\u16eb\text{}\left({r}_{V,k}-\overline{{r}_{V}}\right)}{\sqrt{{\sum}_{k}{\left({r}_{U,k}-\overline{{r}_{u}}\right)}^{2}\text{}\u16eb\text{}{\sum}_{k}{\left({r}_{V,k}-\overline{{r}_{V}}\right)}^{2}}},$$$$sim\_CS\left(U,\text{}V\right)=\frac{{\sum}_{k}{r}_{U,k}\text{}\u16eb\text{}{r}_{V,k}}{\sqrt{{\sum}_{k}{\left({r}_{U,k}\right)}^{2}}\text{}\u16eb\text{}\sqrt{{\sum}_{k}{\left({r}_{V,k}\right)}^{2}}}.$$

- 2.
- Predict the rating value that U would give to an item i; in order to compute the rating prediction p
_{U,i}, the standard CF rating prediction formula [26,57] is typically applied:$${p}_{U,i}=\overline{{r}_{u}}+\frac{{\sum}_{V\in N{N}_{u}}sim\left(U,\text{}V\right)\text{}\u16eb\text{}\left({r}_{V,i}-\overline{{r}_{V}}\right)}{{\sum}_{V\in N{N}_{u}}sim\left(U,\text{}V\right)}.$$

- 3.

## 4. Algorithm Tuning and Experimental Evaluation

- Determine the optimal value of the bsf factor, to tune the proposed algorithm and;
- Evaluate the accuracy of the rating prediction of the proposed algorithm, both when used independently and when combined with a state-of-the-art CF algorithm also aiming at rating prediction accuracy improvement.

#### 4.1. Determining the Algorithm Parameters

_{i}corresponds to a different setting for the computation of the bsf factor as follows:

- Setting 1:
- $bsf\left(U,V\right)=\{\begin{array}{l}1.2,if\left(blackSheepRatings\left(U,V\right)\ge 1\right)\wedge \left(low\_thr=2.5\right)\wedge \left(high\_thr=3.5\right)\hfill \\ 1,\mathrm{otherwise}\hfill \end{array}$
- Setting 2:
- $bsb\left(U,V\right)=\{\begin{array}{l}1.2,if\left(blackSheepRatings\left(U,V\right)\ge 1\right)\wedge \left(low\_thr=1.5\right)\wedge \left(high\_thr=4.5\right)\hfill \\ 1,\mathrm{otherwise}\hfill \end{array}$
- Setting 3:
- $bsb\left(U,V\right)=\{\begin{array}{l}1.2,if\left(blackSheepRatings\left(U,V\right)\ge 1\right)\wedge \left(low\_thr=2.0\right)\wedge \left(high\_thr=4.0\right)\hfill \\ 0.9,\mathrm{otherwise}\hfill \end{array}$
- Setting 4:
- $bsb\left(U,V\right)=\{\begin{array}{l}1.2,if\left(blackSheepRatings\left(U,V\right)\ge 1\right)\wedge \left(low\_thr=2.5\right)\wedge \left(high\_thr=3.5\right)\hfill \\ 0.9,\mathrm{otherwise}\hfill \end{array}$
- Setting 5:
- $bsb\left(U,V\right)=\{\begin{array}{l}1.2,if\left(blackSheepRatings\left(U,V\right)\ge 1\right)\wedge \left(low\_thr=2.0\right)\wedge \left(high\_thr=4.0\right)\hfill \\ 0.8,\mathrm{otherwise}\hfill \end{array}$
- Setting 6:
- $bsb\left(U,V\right)=\{\begin{array}{l}1.2,if\left(blackSheepRatings\left(U,V\right)\ge 5\%\text{}\u16eb\text{}numCommonlyRated\left(U,\text{}V\right)\right)\hfill \\ \wedge \left(low\_thr=2.5\right)\wedge \left(high\_thr=3.5\right)\hfill \\ 0.9,\mathrm{otherwise}\hfill \end{array}$
- Setting 7:
- $bsb\left(U,V\right)=\{\begin{array}{l}1.2,if\left(blackSheepRatings\left(U,V\right)\ge 20\%\text{}\u16eb\text{}numCommonlyRated\left(U,\text{}V\right)\right)\hfill \\ \wedge \left(low\_thr=2.0\right)\wedge \left(high\_thr=4.0\right)\hfill \\ 0.8,\mathrm{otherwise}\hfill \end{array}$

- low_thr denotes the value below which a rating is considered to be negative; formally, is_negative(r
_{U,i}) ⇔ r_{U,i}≤ low_thr - high_thr, correspondingly, represents the value above which a rating is considered to be positive. Formally, is_positive(r
_{U,i})⇔r_{U,i}≥ high_thr - blackSheepRatings(U,V) is the number of ratings where users U and V both have a positive (or negative) rating, while the user community has a negative (or positive), respectively, rating on the same item. Formally:
- ⚬
- $is\_communityPositive\left(i\right)\iff \underset{W\in UC}{\mathrm{average}}\left({r}_{W,i}\right)\ge high\_thr$, where UC is the user community, i.e., the set of users in the dataset
- ⚬
- $is\_communityNegative\left(i\right)\underset{W\in UC}{\iff \mathrm{average}}\left({r}_{W,i}\right)\le low\_thr$
- ⚬
- $is\_BlackSheepRating\left(U,\text{}V,\text{}i\right)\iff (is\_Positive\left(U,\text{}i\right)\wedge is\_Positive\left(V,\text{}i\right)\wedge $ $is\_communityNegative\left(i\right))\text{}\vee \text{}(is\_Negative\left(U,\text{}i\right)\wedge is\_Negative\left(V,\text{}i\right)\wedge $ $is\_communityPositive\left(i\right))$
- ⚬
- $blackSheepRatings\left(U,V\right)=\left|\left\{i\in I:is\_BlackSheepRating\left(U,V,i\right)\right\}\right|$

- numCommonlyRated(U,V) is the number of items that have been rated by both U and V; formally, $numCommonlyRated(U,V=\left|\left\{i\in I:{r}_{U,i}\ne NULL\text{}\wedge {r}_{V,i}\ne NULL\right\}\right|$

#### 4.2. Rating Prediction Accuracy Improvement Achieved by the Proposed Algorithm

#### 4.3. Combining the Proposed Algorithm with a Second Algorithm Targeting Rating Prediction Accuracy Improvement

_{EPC}algorithm [54]. The CF

_{EPC}algorithm is a state-of-the-art algorithm (published towards the end of 2020), also targeting at improving the CF rating prediction accuracy, and not needing any additional information on the items or the users (e.g., user social relationships or item categories). Hence, it can be also applied in all CF datasets. Figure 4 illustrates the improvement in the MAE achieved by the inclusion/combination of the presented algorithm to the CF

_{EPC}algorithm, when using the PCC as the similarity metric and again taking the performance of the plain CF algorithm as a yardstick.

_{EPC}algorithm with the proposed algorithm resulted in a relative improvement of 15%, on average in relation to the gains obtained when using the plain version of the CF

_{EPC}(from 6.8% to 7.8%, in absolute figures), considering the MAE error metric. Similarly, the relative improvement, considering the RMSE error metric has been found to be equal to 19%, on average (from 5.8% to 6.9%, in absolute figures). The experiment demonstrates that the performance gains of the CF

_{EPC}algorithm is further enhanced by approximately the 50% of the performance gains achieved when the proposed algorithm is independently applied on sparse datasets (i.e., the Amazon datasets), while for the dense dataset (Movielens Latest 100K dataset) the performance enhancement of the CF

_{EPC}algorithm is approximately equal to the 25% of the gains achieved by the proposed algorithm on the same dataset.

_{EPC}algorithm, when using the CS as the similarity metric and again taking the performance of the plain CF algorithm as a yardstick.

_{EPC}algorithm with the presented algorithm resulted in a relative improvement of 14%, on average in relation to the gains obtained when using the plain version of the CF

_{EPC}(from 6.7% to 7.6%, in absolute figures), considering the MAE error metric. Similarly, the relative improvement, considering the RMSE error metric has been found to be equal to 15%, on average (from 6.1% to 6.9%, in absolute figures). The experiment demonstrates that the performance gains of the CF

_{EPC}algorithm is further enhanced by approximately 50% of the performance gains achieved when the proposed algorithm is independently applied on sparse datasets (i.e., the Amazon datasets), while for the dense dataset (Movielens Latest 100K dataset) the performance enhancement of the CF

_{EPC}algorithm is approximately equal to the 30% of the gains achieved by the proposed algorithm on the same dataset.

#### 4.4. Complexity Analysis of the Proposed Algorithm

## 5. Conclusion and Future Work

MAE and RMSE reduction achieved by the proposed algorithm, when using the PCC user similarity metric.

MAE and RMSE reduction achieved by the proposed algorithm, when using the CS user similarity metric.

MAE reduction achieved by the inclusion of proposed algorithm to the CF

_{EPC}algorithm, when using the PCC user similarity metric.

MAE reduction achieved by the inclusion of proposed algorithm to the CF

_{EPC}algorithm, when using the CS user similarity metric.

Dataset Name | #Users | #Items | #Ratings | Density |
---|---|---|---|---|

Amazon “Videogames” | 24 K | 11 K | 232 K | 0.09% |

Amazon “CDs and Vinyl” | 75 K | 64 K | 1.1 M | 0.02% |

Amazon “Movies and TV” | 124 K | 50 K | 1.7 M | 0.03% |

Amazon “Books” | 604 K | 368 K | 8.9 M | 0.004% |

MovieLens “Latest 100K—Recommended for education and development” | 670 | 9 K | 100 K | 1.7% |

NetFlix competition | 480 K | 18 K | 96 M | 1.1% |

