When Diversity Met Accuracy: A Story of Recommender Systems †

: Diversity and accuracy are frequently considered as two irreconcilable goals in the ﬁeld of Recommender Systems. In this paper, we study different approaches to recommendation, based on collaborative ﬁltering, which intend to improve both sides of this trade-off. We performed a battery of experiments measuring precision, diversity and novelty on different algorithms. We show that some of these approaches are able to improve the results in all the metrics with respect to classical collaborative ﬁltering algorithms, proving to be both more accurate and more diverse. Moreover, we show how some of these techniques can be tuned easily to favour one side of this trade-off over the other, based on user desires or business objectives, by simply adjusting some of their parameters.


Introduction
Over the years the user experience with different services has shifted from a proactive approach, where the user actively look for content, to one where the user is more passive and content is suggested to her by the service.This has been possible due to the advance in the field of recommender systems (RS), making it possible to make better suggestions to the users, personalized to their preferences.
Most of the research on the field focuses on the accuracy as the main objective of the systems.For example, the Netflix Prize goal was to improve the accuracy of Cinematch (Netflix recommendation system) by 10%, measured by the root mean squared error of the predictions.This competition fuelled the research and several advances came from it.However, in the wake of the results, studies have proven the inadequacy of this measure when it comes to the top-n recommendation task [1], introducing the use of IR metrics, such as precision or the normalized discounted cumulative gain (nDCG), to assess the performance of the system.To introduce these measures non-rated items are considered as non relevant.It has been acknowledged that making this consideration may underestimate the true metric value; however, it provides a better estimation of the recommender quality [2].
Other studies have also pointed out the convenience of measuring different properties of recommender systems such as diversity or novelty [3,4].A system that is able to produce novel recommendations increases the probability of suggesting items to a user that would not have discovered by herself; this property is called serendipity.This quality is often associated with user satisfaction [5], but it is difficult to measure, usually involving online experiments.We use novelty as a proxy to measure this property.Being able to produce diverse recommendations, that make use of the full catalogue of items instead of focusing on the more popular ones, is usually an added benefit to a recommender system.Diversity is highly appreciated by vendors [6,7].
We analysed the performance of a couple of memory-based recommender systems, both using four different clustering techniques to compute the neighbourhoods.This performance was evaluated in term of precision, diversity and novelty metrics.We also analysed how the systems perform with different values of their parameters, with the intent of showing how the performance of the systems with respect to the trade-off between accuracy and diversity/novelty can be tuned to suit the needs of the user or the business objectives.

Materials and Methods
We conducted a series of experiments in order to analyse the trade-off between accuracy, diversity and novelty in Recommender Systems.

Algorithms
We choose two memory-based based algorithms to analyse their performance.The first one, Weighted Sum Recommender (WSR), is a formulation of the classic user based recommender that stands out for its simplicity and performance [8].The second one is an adaptation of Relevance-based Language Model (frequently abbreviated as Relevance Models or RM), used in text retrieval to perform pseudo relevance feedback [9].In particular, we used the RM2 approach, which showed superior performance than RM1 [10].
Both algorithms use the notion of the neighbourhood of a user to perform their calculations.Intuitively, they decide to recommend or not an item based on the preferences of other users that are considered similar to the active one.We explored four clustering techniques to calculate these neighbourhoods with both algorithms.The first one, k-Nearest Neighbours (k-NN), is a well-known technique commonly used with neighbourhood based algorithms [11].As a second method, we also tested a modification of the k-NN technique, inverted nearest neighbours (k-iNN), that claim to improve both novelty and accuracy [12].Another technique we used was Posterior Probabilistic Clustering [13], in particular the model that uses the K-L divergence cost function (PPC2).Lastly, we used the Normalized Cuts (NC), a technique used in image segmentation [14], adapted to partition users into clusters.These last two techniques are hard clustering techniques, where a user can only be part of a single cluster.On the contrary, the first two are soft clustering techniques, meaning that a user can be in more than one cluster at the same time.These two methods also make use of a similarity measure, that has to be defined independently.For our research, we used the cosine similarity in both cases.

Evaluation Protocol
We report out result only on the MovieLens 100k dataset, given the space constraints, although similar trends have been observed in other collections.This is a very popular public dataset for evaluating collaborative filtering methods.It contains 100,000 ratings that 943 users gave to 1682 items.We used the splits provided by the collection to perform 5-fold cross-evaluation.
To evaluate de effectiveness of the recommendations we used the Normalized Discounted Cumulative Gain (nDCG), using the standard formulation as described in [15] with ratings as graded relevance judgements.In our experiments, only items with a rating of 4.0 or higher are considered relevant when evaluating.To assess the diversity of the recommendations we use the inverse of the Gini index [6].When a value of the index is 0 it signifies that a single item is being recommended to all users.A value of 1 means that all items are recommended equally to all the users.To evaluate the novelty we use the mean self-information (MSI) [16].All the metrics are evaluated at a cut-off of 10.We do this because we are interested in evaluating the quality of the top recommendations.

Results
We tested all the combinations of recommender and clustering techniques.For the soft clustering methods (k-NN and k-iNN) we varied the number of neighbours between 25 and 200.For the hard clustering techniques (PPC2 and NC) we obtained the results modifying the number of clusters between 10 and 100.The results in terms of accuracy (nDCG), diversity (Gini) and novelty (MSI) can be observed in Figure 1.Values of nDCG@10, Gini@10 and MSI@10 of all studied algorithms when varying the number of clusters or neighbours.
When it comes to accuracy alone both k-NN and k-iNN show a superior performance when compared to the hard clustering methods, offering both similar results in term of nDCG.For these the type of recommender that offers the best results varies.k-NN obtains better results with the RM2 algorithm.In the case of k-iNN, it is the WSR algorithm that gets the better results.
In the case of the diversity and novelty results, it can be observed that most of the time tuning a method to provide more accurate results leads to a decrease in these other to measures.This is not always true, as can be seen with the soft clustering techniques, when increasing the numbers of neighbours too much leads to decreases in accuracy, diversity and novelty.It can also be seen that different algorithms can obtain different levels of diversity and novelty at the same level of accuracy.In this regard, the k-iNN method shows superior levels of diversity and novelty when compared to the k-NN technique at similar levels of accuracy, confirming the claim of their proponents.

Discussion
Results show that the intuition that during the process of tuning a recommender raising the accuracy leads to decreases in novelty and diversity holds most of the time, but there can be situations when this is no longer true, and the performance of the system moves in the same direction for all the metrics when changing a parameter.
But the results also show that the choice of algorithms is important when it comes to improving the properties of the system.It is possible to improve the performance of the system in diversity and novelty, while maintaining similar levels of accuracy.It is also possible to tune the system to balance how well it performs in all the metrics.This is a multi-objective problem and a trade off must be chosen, either by a priori setting the weight that each measure has, or by choosing any of the possible combination of parameters from the values in the Pareto front.

Figure 1 .
Figure 1.Values of nDCG@10, Gini@10 and MSI@10 of all studied algorithms when varying the number of clusters or neighbours.