On Exploiting Rating Prediction Accuracy Features in Dense Collaborative Filtering Datasets

: One of the typical goals of collaborative filtering algorithms is to produce rating predictions with values very close to what real users would give to an item. Afterward, the items having the largest rating prediction values will be recommended to the users by the recommender system. Collaborative filtering algorithms can be applied to both sparse and dense datasets, and each of these dataset categories involves different kinds of risks. As far as the dense collaborative filtering datasets are concerned, where the rating prediction coverage is, most of the time, very high, we usually face large rating prediction times, issues concerning the selection of a user ’ s near neighbours, etc. Although collaborative filtering algorithms usually achieve better results when applied to dense datasets, there is still room for improvement, since in many cases, the rating prediction error is relatively high, which leads to unsuccessful recommendations and hence to recommender system unreliability. In this work, we explore rating prediction accuracy features, although in a broader context, in dense collaborative filtering datasets. We conduct an extensive evaluation, using dense datasets, widely used in collaborative filtering research, in order to find the associations be-tween these features and the rating prediction accuracy.


Introduction
One of the most widely applied recommender system (RS) methods, over the last 20 years, is collaborative filtering (CF) [1,2].The typical goal of a CF algorithm is to produce rating predictions for products or services that users have not already evaluated.The closer these rating predictions are to the rating values that the users themselves would give to these products or services, the higher accuracy the CF algorithm will have.
Afterwards, based on the aforementioned rating predictions, a CF RS will typically recommend, to each user, the products or services scoring higher rating prediction values.These products carry the highest probability, among all products or services, that the user will actually like them and hence accept the recommendation (by clicking the product advertisement, buying the product or service, etc.) [3,4].
The first step of a typical CF system is to locate the 'near neighbours' (NNs) for each of its users.An NN of user u is another user v who shares similar likings with u.This can be found by taking the stored real ratings of users u and v, set_of_ratingsu, and set_of_ratingsv, finding the ones given to common products or services i (i.e., the intersection of the two sets), and comparing them.If the majority of them are (to a large extent) similar, then these users are NNs with each other [5,6].Typically, in modern CF RSs, the aforementioned task is implemented using a user vicinity metric, such as the Pearson correlation coefficient (PCC) and the cosine similarity (CS), which quantify the vicinity between two CF users with a numeric value [7,8].
The second step of a typical CF system is to compute a rating prediction value rpvu,i of user u to the product i; for this process, the CF system uses the real ratings of u's NNs (found in the previous step) to the same item.The rationale behind this setting is that, in the real world, a person usually trusts the people considered closer to him/her (higher vicinity), when asking for a recommendation, regardless of the product or service category [9,10].
The accuracy of CF systems is measured by the closeness of the rating prediction values to the real user rating values.Accuracy is a very active research field, where the majority of the research works aim at reducing the overall deviation between the predicted values of the ratings that the users would give to the products and the values of real user ratings to the products.In order to evaluate the success of the aforementioned process, the CF algorithms are applied to real CF rating datasets (such as the Amazon and the MovieLens datasets [11,12]), usually containing records consisting of the user, the product, the rating, and maybe additional information (from the rating timestamp to information concerning the user and/or the product).Accuracy is evaluated by hiding a percentage of the ratings, then trying to predict their values, and finally assessing how close the prediction is to the real rating.
Despite a plethora of research works aiming to increase the CF rating prediction accuracy [13][14][15], very limited research aims to examine the characteristics of CF users, products, or the dataset itself, which may affect the accuracy of CF rating predictions.An exception is the work of researchers that utilise the user neighbourhood [16][17][18][19].These works have been performed, in general, for evaluating the performance of specific algorithms.
Our previous work [20] explored the accuracy of rating prediction features in sparse CF datasets, in a broader context, proving that a typical CF system which simply recommends the items achieving higher rating prediction values than others may offer reduced recommendation accuracy, in many cases, and hence negatively affect the success of the RS.
In this paper, we adapt the aforementioned research in the context of dense CF datasets.More specifically: a.We explore the same rating prediction accuracy features in dense CF datasets but with different parameter values.For example, our previous work showed that, for sparse CF datasets, when a rating prediction is formulated using the real ratings of ≥4 NNs, it is an indication of a highly accurate rating prediction.For dense CF datasets, however, a user can have hundreds or even thousands of NNs.As a result, the percentage of the rating predictions formulated with ≥4 NNs is almost 100% and hence the output of the previous work cannot be applied in dense CF datasets; b.We examine (evaluate and parameterise) the effects of one extra rating prediction accuracy feature, namely the NN variety, which, in contrast to sparse CF datasets, can be reliably quantified in dense CF datasets.
To ascertain the reliability of the results produced, the present work uses (i) two widely accepted metrics of user similarity, (ii) two widely accepted rating prediction error metrics, (iii) six widely accepted dense CF datasets, and (iv) three different CF algorithms, so that it can experimentally derive insight on the rating prediction accuracy features.The rest of the paper is structured as follows: Section 2 overviews the related work, while in Section 3, we present, analyse, and evaluate the rating prediction features in dense CF datasets.The obtained results are discussed in Section 4, and Section 5 concludes the paper and outlines future work.The definitions for terms, notations, and abbreviations used in this paper are tabulated in Table 1.The average rating value of the user for whom the prediction is formulated Iavg The average rating value for the item for which the prediction is formulated UN The number of ratings entered by the user for whom the prediction is formulated

IN
The number of ratings that have been entered for the item for which the prediction is formulated NNsvar The variance of the NNs' ratings given to the item for which the prediction is formulated

Related Work
The CF system accuracy research is divided into two main categories.The first category comprises algorithms which, apart from the basic information a CF system needs in order to produce rating predictions (i.e., the user-item-rating matrix), utilise supplementary elements or sources of information.These include user and item information, such as user relations in social networks (SNs) (friendship, trust, etc.), user demographic information (gender, age, nationality, etc.), item categories (e.g., taxonomy) and characteristics (colour, price, availability, etc.), user reviews on an item, etc.The second category comprises algorithms that exploit only the basic CF information and the user-item-rating matrix and formulate specialised processing methods (e.g., the clustering, computation, and exploitation of outliers, rating variability, etc.) to increase the rating prediction accuracy.
Regarding the first category, Yang et al. [21] introduced a matrix factorisation (MF) technique which improves the performance of CF recommendations by integrating the sparse social trust network data with the sparse user rating data, among the same users.Their model-based technique maps CF users, based on their trust relationship, into lowdimensional latent feature spaces, and aims to reflect the users' reciprocal influence on the formation of their own ratings.Yang et al. [22] presented a set of MF-and NN-based RS and explore group affiliations and SN information for recommendations in social voting.They demonstrated that the aforementioned information can improve the accuracy of popularitybased voting recommendations.They also observed that group and social information was proven to be more valuable to cold users.Hu et al. [23] presented a technique, namely MR3, which aligns the latent factors and hidden topics, in order to model item reviews and social relations with ratings, for improving the rating prediction accuracy.Furthermore, they incorporated the implicit feedback from ratings into their model, to enhance their technique.Margaris et al. [18] introduced an algorithm which combines the limited SN information of users' social relations with the limited CF information of users' ratings on items targeting the enhancement of both the rating prediction coverage and rating prediction accuracy in CF RSs.Their algorithm takes into account the utility and density of both CF and SN neighbourhoods, by formulating two partial rating predictions: the CF score and the SN score.Then, it combines these scores using a weighted average metascore algorithm with userpersonalised weights.Pereira and Varma [24] presented a financial planning RS that modifies the recommendation process to enhance the recommendations.They used a hybrid approach to overcome CF drawbacks, such as data sparsity, the new user cold-start problem, and overspecialisation, which combines CF techniques with demographic filtering.Ghasemi and Momtazi [25] introduced a technique that improves CF RSs by finding similar CF users based on both their ratings and reviews.They utilised two lexical-based techniques, two word-neural representation techniques, and three text-neural representation techniques.Zhou et al. [26] introduced MLCF, a multi-label classification-based CF framework that enhances the recommendation quality, which is based on three graphs, namely a user, an item, and a rating bipartite.They explored the latent correlations among items and users.They also introduced a multi-label classification rating similarity metric which captures user-class-specific relationships.Finally, they introduced the integration of two multilabel classification CF techniques, focusing on social information and rating, into a unified rating prediction technique.
Although all the aforementioned works considerably enhance the CF rating prediction accuracy and recommendation success, the source of supplementary information that is required (user SN information, user demographic information, user and item characteristics, etc.) may not always be available and, hence, cannot be applied to every CF dataset.
To this extent, Wang et al. [27] proposed the integration of the interactions between items and users.They introduced the neural graph CF, a recommendation framework that propagates embeddings on the user-item graph structure, which explicitly injects, into the embedding process, the collaborative signal.Yu et al. [28] proposed a two-sided crossdomain CF model, which balances recommendation efficiency and accuracy.This model is based on selective ensemble learning considering both efficiency and accuracy.Their model first combines the item-sided with the user-sided auxiliary domains, aiming to enhance the target domain performance.The cross-domain CF problem is then transformed into an ensemble learning problem, thereby transforming the selective combination problem into a selective classifier problem.Ajaegbu [29] introduced an algorithm which balances three user similarity metrics to overcome cold-start issues.This algorithm mitigates the data sparsity and cold-start issues that the three traditional algorithms face, as well as retains the positive features the existing item-based CF algorithms have.Margaris et al. [30] presented an algorithm which improves the rating prediction accuracy in CF without the need for any kind of supplementary information.They achieved accuracy improvement by enhancing the weight of the black sheep NNs' opinions.More specifically, they adjusted the NN weights, based on the degree to which the NN and the target user deviate from each of the dominant ratings of each item.Zarzour et al. [31] introduced a new CF algorithm based on clustering techniques and dimensionality reduction.The proposed algorithm uses singular value decomposition to reduce the dimensionality, while it uses the K-means algorithm in order to cluster similar users.They also proposed and assessed a two-stage RS which produces efficient and accurate recommendations.Neysiani et al. [32] presented a method that produces association rules, based on genetic algorithms, which identify these association rules in an unsupervised manner, while at the same time, they are efficient for space search.For this method, the users do not need to specify support thresholds.Additionally, in contrast to traditional mining models, it does not need to discover a large number of rules.Chen et al. [33] presented a CF recommendation technique based on evolutionary clustering and user correlation.The authors pre-processed the rating matrix with dimension reduction and normalisation to obtain denser rating data.They used dynamic evolutionary clustering and the highest similar-interest NN research.Finally, they proposed a user relationship metric that applies potential information and user satisfaction.
Still, none of the aforementioned research works examines those features related to the rating prediction accuracy in CF datasets.An RS that typically recommends the items computed to have the highest rating prediction scores, without taking into account other features, may result in reduced recommendation accuracy and overall success and, hence, cause trust issues in its users.
Recently, Margaris et al. [20] indicated that it may be better for an RS to recommend an item i2 computed to have a lower prediction value than item i1, if the rating prediction concerning i2 is found to be more reliable than the respective one for i1, by exploring, in a broader context, the rating prediction accuracy features in sparse CF datasets.They examined five rating prediction features, using sparse CF datasets, and found that three of them (the number of NNs participating in the rating prediction formulation, the mean rating value of the active user, and the mean rating value of the item) can indicate, in the majority of the cases, a reliable rating prediction.
The present work advances the state-of-the-art research regarding the rating prediction accuracy features in CF, by (1) applying and parameterising the aforementioned rating prediction accuracy features in dense CF datasets, (2) applying and parameterising an additional rating prediction accuracy feature that can be reliably applied only in dense CF datasets, and (3) evaluating the rating prediction accuracy features using widely used and accepted dense CF datasets, error metrics, and user similarity metrics.Since the features tested in this work are derived from the original rating matrix (user-item-rating tuple) and do not need any kind of additional information, they prove useful to any CF algorithm applied to dense CF datasets.

Exploring Rating Prediction Features
In this section, the six rating prediction features of the experimental part of our work are presented, analysed, and evaluated.More specifically, we present a thorough investigation of the rating prediction features, examining their associations with improvement in the rating prediction accuracy in dense CF datasets.
The six rating prediction features, along with their cases tested in the experimental procedure, are the following:

•
The percentage of the active user U's near neighbours (NNs) taking part in the rating prediction (NN%): for this feature, we examined values from 0% to 24%, with the increment step set to 3%;

•
The active user's U average rating value (Uavg): for this feature, we examined the range from the minimum allowed rating value to the maximum allowed rating value in the dataset, with the increment step set to 0.5;

•
The average rating value of the item for which the prediction is computed (Iavg): for this feature, we examined the range from the minimum allowed rating value to the maximum allowed rating value in the dataset, with the increment step set to 0.5;

•
The number of items that the active user U has rated (UN): for this feature, we examined values from 100 to 500, and an extra case of >500, with the increment step set to 100;

•
The number of users that have rated item i for which the prediction is computed (IN): for this feature, we examined values from 100 to 500 and an extra case of >500, with the increment step set to 100;

•
The variance of the NNs' ratings given to the item for which the prediction is computed (NNsvar): for this feature, we examined values from 0.0 to 2.5, and an extra case of >2.5 with the increment step set to 0.25.
To ascertain that our work is NN-independent, measurements were obtained using all the NNs each active user has, setting the NN vicinity threshold to 0.0 [20,34,35].
In our experiments, we used CF datasets that are widely accepted and used in CF research [36][37][38].All datasets are relatively dense (their densities range from 0.13% to 5.88%); Table 2 presents their essential information [12,[39][40][41].While there is no agreed threshold for classifying a dataset as "dense" or "sparse", the density of all the datasets examined in this work is at least 60% higher than the density of the datasets examined in our previous work [20], which are characterised as "sparse".The density of a dataset d is defined as the ratio of the number of elements of the user-item rating matrix that have non-null values to the total number of the elements of the user-item rating matrix.Density can be computed as  = # # * # .To ascertain that a single rating value range was used, in order for the results of the different datasets to be comparable, the ratings in each dataset were normalised in the range [1.0-5.0], using the standard min-max formula [42], which is used in many CF research works [43][44][45].In order to quantify the rating prediction accuracy, the following two CF rating prediction error metrics were used [46,47]: 1.The mean average error (MAE) metric, which handles all errors uniformly; 2. The root-mean-squared error (RMSE) metric, which boosts the significance of large deviations between the real user rating and the rating prediction produced by the CF system.
To compute the deviation between the rating prediction and the real rating value, the typical "hide-one" technique [48,49] was used for all the ratings in each dataset (where we sought to predict the value of all the ratings-one at a time-in the dataset).In more detail, each time one rating of the dataset is hidden, its value is predicted using the non-hidden ratings."The "hide-one" or "leave-one-out cross-validation" approach is widely used in CF works [50][51][52], and it has the advantage of producing model estimates with less bias and more ease [53].Its main disadvantage is that it cannot be applied online in very large datasets (due to the number of computational steps); however, in our work, it was performed offline.
To ascertain that our work is algorithm-independent, we obtained measurements using three different CF algorithms:

•
A "plain" CF algorithm [54,55]; A sequential CF algorithm [56]; A CF algorithm which exploits common rating histories until the review time of the item for which the prediction is being formulated [57].
A close agreement between the results from all the experiments was observed (less than 4% difference in all the cases); hence, for conciseness, we report only the results produced by the plain CF algorithm.

The NNs' Percentage Taking Part in the Rating Prediction
Figure 1 illustrates the MAE observed for the six aforementioned dense datasets, considering the NN% feature and using the PCC user similarity metric.For all the datasets, when the percentage of the NNs taking part in the rating prediction increased, an MAE drop was observed, until this percentage reached the value of 15%.After that value, a different behaviour between the datasets was observed; the MAE change became non-monotonic.However, for all the cases, the MAE for values of NN% exceeding 15% was less than the MAE at NN% = 15%.The average MAE and RMSE reductions from the case of NN% ≈ 0% to the case of NN% = 15% were measured to be equal to 13% and 12%, respectively.When the CS user similarity metric was used, the exact same phenomenon was observed.More specifically, the average MAE and RMSE reductions from the case of NN% ≈ 0% to the case of NN% = 15% were measured to be equal to 10% and 9%, respectively.
Overall, we can conclude that a correlation exists between the NN% feature and the rating prediction accuracy in dense CF datasets.When the active user's mean rating value was in the middle of the rating range (2 ≤ Uavg ≤ 3), the accuracy was low, a fact that is reflected in the value of the MAE.This is in contrast to the case when the active user's mean rating value was close to the rating range's boundaries, and especially to the higher one (Uavg ≥ 4.5).In the latter case, the average MAE and RMSE observed were lower by 61% and 50%, respectively, when compared with the case in which the active user's mean rating value was in the middle of the rating range (2 ≤ Uavg ≤ 3).When the CS user similarity metric was used, the exact same phenomenon was observed.More specifically, the average MAE and RMSE observed for the cases in which the user's mean rating was close to the upper rating scale boundary were 53% and 46% smaller, respectively, when compared the case in which the active user's mean rating value was in the middle of the rating range (2 ≤ Uavg ≤ 3).Notably, this behaviour pattern was observed in all the datasets, regardless of their density, since the dataset density had no impact on the Uavg quantity.

Uavg Feature
Overall, we can again conclude that a correlation exists between the Uavg feature and the rating prediction accuracy in dense CF datasets.

Iavg Feature
Figure 3 illustrates the MAE observed on the datasets, considering the Iavg feature and using the PCC user similarity metric.
When the active item's mean rating value was towards the low end of the rating range but not very close to it (1.5 ≤ Iavg ≤ 2.5), the prediction accuracy was low.Conversely, when the mean rating value of the active item is close to the boundaries of the rating range, the prediction accuracy is high, especially in the higher rating range.This was especially evident for values close to the upper boundary (Iavg ≥4.5).In the latter case, the average MAE and RMSE observed were 42% and 37% smaller, respectively, when compared with those of the case in which the active item's mean rating value was in the "low-accuracy" area (1.5 ≤ Iavg ≤ 2.5).When the CS user similarity metric was used, the exact same phenomenon was observed.More specifically, the average MAE and RMSE observed were 40% and 37% reduced, respectively, when compared with those of the case in which the mean rating value of the active item was in the "low-accuracy" area (1.5 ≤ Iavg ≤ 2.5).Overall, we can again conclude that a correlation exists between the Iavg feature and the rating prediction accuracy in dense CF datasets.A divergent behaviour was observed among the six datasets when the UN feature's value increased, while, in general, no clear minima or maxima could be identified when the value of UN varied.Hence, no correlation can be established between the UN feature and the rating prediction accuracy in dense CF datasets.A divergent behaviour was again observed among the six datasets when the value of the IN feature increased, and, in general, no clear minima or maxima could be identified when the value of UN varied.Hence, no correlation can be established between the IN feature and the rating prediction accuracy in dense CF datasets.When the variance of the NNs' ratings to the item for which the prediction was being formulated was relatively low, especially in the range of 0.25-0.75, a high level of prediction accuracy was observed.More specifically, the average MAE reduction from the cases when IVAR > 2.5 to 0.25 ≤ IVAR ≤ 0.75 equalled 43%, while the respective average RMSE reduction equalled 40%.When the CS user similarity metric was used, we observed the exact same phenomenon.More specifically, the average MAE and RMSE reductions from IVAR > 2.5 to 0.25 ≤ IVAR ≤ 0.75 were measured to be equal to 33% and 27%, respectively.

NNsvar Feature
Overall, we can conclude that a correlation exists between the NN% feature and the rating prediction accuracy in dense CF datasets.

Discussion of the Results
From the experimental evaluation presented in the previous section, we can conclude that, in dense CF datasets, a CF rating prediction was found to be more reliable in the following cases: 1.The percentage of the active user's NNs taking part in the rating prediction was ≥15%: when taking into account ≥ 15% of a user's NNs, the CF rating prediction was considered more sound, due to the fact that, as in real life, a recommendation based on many opinions bears a high success probability; 2. The active user's mean rating value was close to the limits of the rating range: it is much easier for a rating prediction system to predict the next rating of a user who almost always enters either excellent or bad ratings.3. The predicted item's mean rating value was close to the limits of the rating range: it is much easier for a rating prediction system to predict the next rating for an item that is practically considered either widely acceptable or unacceptable.4. The variance of the user's NNs' ratings to the predicted item was relatively low (in the 0.25-0.75range): it is easier for a rating prediction system to predict a rating for a user whose close people share similar opinions (either good or bad) for an item.
Using the above findings, the accuracy of an RS can be significantly improved, since the RS may opt not to recommend an item with a high prediction score that is, however, deemed of low reliability but include an alternative item in the recommendation which may have a slightly lower prediction but is associated with high confidence.
It is worth noting that the FilmTrust dataset exhibited different behaviour than the other datasets used in the experiments, in particular regarding the Iavg and UN features (Figures 3 and 4).The diverging behaviour observed in Figure 3 is due to the predictions related to those items with low Iavg values, the variance of which was found to be very high in this dataset.Notably, the FilmTrust dataset had the lowest rating average among  4 is attributed to the fact that, in the FilmTrust dataset, most users have rated few items, and consequently, the FilmTrust data points were associated with high UN data values (ranges of 301-400 and 401-500) practically representing outliers (less than 20 users per range), while the data point corresponding to the range ">500" was missing because no user had more than 500 ratings in this dataset.A more in-depth analysis of the effect of the statistical distribution of dataset features and skews on the behaviour of the dataset will be considered in our future work.

Conclusions and Future Work
In this work, six rating prediction accuracy features in dense CF datasets, with the aim to determine whether they directly affect the rating prediction accuracy, were explored.To ascertain the reliability of the results produced, two widely accepted metrics of user similarity, two widely accepted rating prediction error metrics, six widely accepted dense CF datasets, and three different CF algorithms were used to experimentally provide insight into the rating prediction accuracy features.
The evaluation results showed that (a) the percentage of the active user's NNs taking part in the rating prediction, (b) the active user's mean rating value, (c) the predicted item's mean rating value, and (d) the variance of the active user NNs' ratings in relation to the predicted item were correlated with the reduction in the rating prediction accuracy.
In our future work, we plan to refine the CF rating prediction algorithm, by quantifying the reliability of a CF rating prediction, based on the four prediction features that were found to affect the rating prediction accuracy in this work.Moreover, we will focus on exploring additional rating prediction features in dense CF datasets.

Figure 1 .
Figure 1.Effect of the NN% feature, in the rating prediction MAE, when the PCC user similarity metric was used.

Figure 2
Figure 2 illustrates the MAE observed on the datasets, considering the Uavg feature and using the PCC user similarity metric.

Figure 2 .
Figure 2. Effect of the Uavg feature, in the rating prediction MAE, when the PCC user similarity metric was used.

Figure 3 .
Figure 3.Effect of the Iavg feature, in the rating prediction MAE, when the PCC user similarity metric was used.

Figure 4
Figure 4  illustrates the MAE observed on the datasets, considering the UN feature and using the PCC user similarity metric.

Figure 4 .
Figure 4. Effect of the UN feature, in the rating prediction MAE, when the PCC user similarity metric was used.

Figure 5
Figure 5 illustrates the MAE observed on the datasets, considering the IN feature and using the PCC user similarity metric.

Figure 5 .
Figure 5.Effect of the IN feature, in the rating prediction MAE, when the PCC user similarity metric was used.

Figure 6
Figure 6  illustrates the MAE observed on the datasets, considering the NNsvar feature and using the PCC user similarity metric.

Figure 6 .
Figure 6.Effect of the NNsvar feature, in the rating prediction MAE, when the PCC user similarity metric was used.
datasets (3.00 against 3.53 of the MovieLens datasets and 3.74 in the Dianping So-cialRec 2015 datasets).The deviating behaviour in Figure

Table 1 .
Definitions for terms, notations, and abbreviations used.