Aggregation of Rankings Using Metaheuristics in Recommendation Systems

: Recommendation systems are a powerful tool that is an integral part of a great many websites. Most often, recommendations are presented in the form of a list that is generated by using various recommendation methods. Typically, however, these methods do not generate identical recommendations, and their effectiveness varies between users. In order to solve this problem, the application of aggregation techniques was suggested, the aim of which is to combine several lists into one, which, in theory, should improve the overall quality of the generated recommendations. For this reason, we suggest using the Differential Evolution algorithm, the aim of which will be to aggregate individual lists generated by the recommendation algorithms and to create a single list that will be ﬁne-tuned to the user’s preferences. Additionally, based on our previous research, we present suggestions to speed up this process.


Introduction
In today's world where the amount of information available is overwhelming for a common user, the use of systems designed to support the user in making decisions is becoming more apparent. This role is taken on by recommendation systems, which are more commonly used in various areas of our life. From buying items on auction sites through selecting a movie to adding new friends on social networks. The growing popularity of this type of website means that there is a real demand for recommendation systems that work efficiently and not only increase the quality of the generated recommendations but also ensure their novelty and diversity [1].
Within the recommendation systems, we can distinguish two main approaches to creating a recommendation. They can be based on an attempt to predict what rating (e.g., on a scale from 1 to 5) the user would give to an item in the system. They can also attempt to predict a certain set of items, most often presented in the form of a list that would be recommended to the user [2] (this problem is also called the top-N recommendations problem). Additionally, we can rely on data entered directly by the user or we can infer their preferences by observing how they use the system. This article will also discuss the problem of rank aggregation, which has been described thoroughly in the literature, especially in the context of information retrieval systems [3][4][5] and proven to be NP-hard [6] even for small collections of ranks (e.g., 4 or more). However, according to some researchers [7], this topic has not yet been sufficiently studied in the context of recommended systems. Depending on the dataset used, individual recommendation algorithms can generate different recommendations, and choosing one particular algorithm over others can decrease the quality of recommendations for some of the users. Therefore, the use of aggregation techniques has been proposed also in this context where the aim is to combine the individual lists generated by different recommendation techniques in order to create one "super" list.
Additionally, due to the fact that we will be optimizing the average precision (AP) measure, the Differential Evolution (DE) algorithm will be used, which is a metaheuristic that makes the direct optimization of this measure possible [8]. Our method is universal, and thus any metaheuristic algorithm that is used for real-valued optimization can be used here (e.g., PSO [9]). We chose the DE to conduct our research, due to the fact that it is well-suited for this type of optimization [10][11][12]. DE is arguably one of the most versatile and stable population-based search algorithms that exhibits robustness to many different optimization problems [13]. Additionally, it is relatively simple to implement and has a small number of control parameters, which makes this algorithm easy to tune.
The main contribution of this paper is to present how the DE algorithm can be applied to the problem of rank aggregation in recommendation systems, which will be supported by tests performed on the MovieLens 100k data set [14]. We will also present, based on our previous work [15], how to accelerate this algorithm while generating ranking lists of items using a dedicated fitness function. This function can also be successfully used in other metaheuristics that use real-valued representations of individuals in a population. In addition, we will present research that will show that the use of metaheuristic algorithms in the context of the problem of rank aggregation can be additionally justified due to the resistance of these techniques to algorithms that generate low-quality recommendations.
The article is divided into six chapters. Section 2 constitutes a literature review with information about the current literature. Section 3 presents a formal definition of a recommendation system, an explanation of the ranking aggregation problem and the Differential Evolution algorithm. Section 4 presents a description of our algorithm along with the system architecture and a figure showing a simple example regarding how the matrix fitness function is calculated. Section 5 discusses how the test environment was prepared for conducting the experiments and presents the results with commentary. The final Section 6 discusses our conclusions and research proposals for the future.

Literature Overview
The problem of recommendations can be presented as the problem of predicting how a user would rate a given item (e.g., on a scale from 1 to 5) [16], or as the problem of creating a list of suggested items and is referred to as the Top-N recommendation problem [17]. In fact, the latter is more similar to the real-life scenario when working with recommendation systems [18], where the recommendations are most often presented in the form of a list of suggested items in which the elements at the beginning are more important than the ones at the end.
There have been many works describing this approach in the context of recommendation systems [2,17]. In order to evaluate the quality of such recommended lists, measures that take into account the order in which the items appear on the list are used, e.g., Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). Due to the fact that these measures are usually difficult to directly optimize, metaheuristic algorithms can be applied here [8,19]. A good review of evolutionary algorithms in recommendation systems is the paper [20], in which the authors presented an overview of the current research in this area and suggestions for research in the future.
In this article, we also pay attention to the problem of rank aggregation. A great deal of work has been done on this subject, especially in the context of information retrieval systems [21]. We generally divide the algorithms used for rank aggregation into two categories: permutation-based and score-based. There are many suggested techniques in the literature, for example: Borda Count [6], COMB* [22] (e.g., COMBSUM and COMBMNZ), or OutRank [23]. Within the context of recommendation systems, there have also been several works addressing this problem. In [24], a system for creating recommendations for the entire group of users was suggested, instead of as usually done for one user only.
In the work [25], the authors suggested creating a multi-criteria recommendation system, which, in addition to the quality of the generated recommendations, also took into account measures, such as novelty and diversity. In [26], the authors used genetic programming to create a recommendation system that generated recommendations by optimizing the MAP measure. It is also worth paying attention to [7], in which the researchers asked themselves whether the problem of rank aggregation in the context of recommendation systems is worth looking into. They performed extensive experiments and suggested the direction in which future work in this area should go.

Background of the Research
This chapter explains the basic information and the definitions used in this article. At first, the definition of the recommendation system, methods of obtaining feedback from users and the problem of matrix factorization will be discussed. Then, we will present the problem of rank aggregation in the context of recommendation systems. Finally, we will present a metaheuristic algorithm that will be used during our research.

Recommender System
In a recommendation system, we distinguish a certain set of users U = u 1 , . . . , u |U| and a certain set of items I = i 1 , . . . , i |I| . Each of the users u ∈ U has interacted with some of the items i ∈ I. The task of the recommendation system for the Top-N recommendation problem is to, on the basis of the historical data collected in the system, predict the user's next choices and create a list of items that are likely to interest the user. High-quality recommendations contribute to user satisfaction, which can translate into an overall good impression when using the platform. Depending on what kind of feedback is obtained from the user, recommendation techniques can be based on data from: • Implicit feedback-This feedback is obtained by analyzing the user's behavior in the system, e.g., clicking on a specific product, page views and adding an item to the basket [27]. This type of feedback is easier to obtain as there is no need to ask the user to interact with the system (e.g., commenting and rating items). The main disadvantage of this approach is the lack of information on whether the interaction with the object was positive or negative [28]. For example, the user may have accidentally added an item to the basket and later removed it, and the mere fact of opening a page does not mean that the user likes the item. For this reason, the implementation of systems based on this type of data is associated with a number of challenges and has been described in many works [29,30]. • Explicit feedback-Feedback is obtained from the user in a direct way, for example the system asks the user to rate a given item [31]. The main advantage of this type of feedback is that it is easier to determine whether the interaction with the system was positive or negative. For example, if the user can enter a rating on a scale from 1 to 5 and selects a rating of 5, then, with a high probability, it can be assumed that this is an item that the user likes. • Hybrid feedback-This is a combination of the two previously discussed techniques [32].
It should also be noted that recommendation systems often do not have good quality features for users and items. For this reason, various methods of obtaining them have been proposed, and one of the most popular techniques is to factorize the user-item matrix. With this, we can obtain features that are also called latent features. More on the subject can be found in [33].

Rank Aggregation Problem
This section describes the problem of rank aggregation in the context of recommendation systems. We define a ranking as an ordered list of items τ = [i j >= i h >= · · · >= i z ], where the items at the beginning of the list (first position) are more significant than those at the end (last position). Item positions i j in ranking τ, we define as τ(i j ). Two items i j ∈ τ and i h ∈ τ can be compared by checking their position in the list τ. If the item i j is ranked higher in the τ in comparison to the item i h , it is defined as τ(i j ) > τ(i h ).
In recommendation systems, aggregations are generated through various algorithms, where a single algorithm will be defined as a h , and a set of n recommendation algorithms will be defined as A = {a 1 , a 2 , . . . , a n }. Each of the algorithms a h ∈ A generates a ranking τ, and the set of all n created rankings is defined as T = {τ 1 , τ 2 , . . . , τ n }. In addition, all algorithms that generate recommendations take, as input, matrix M m×n . Each row in this matrix represents a user u i ∈ U, and each column represents an item i j ∈ I. The value of this matrix M i,j corresponds to the rating given by the user u i to the item i j . Note that users rate only a small fraction of the items appearing in such a matrix; therefore, such a matrix is very sparse.
The problem of rank aggregation can be defined as the problem of finding such a combination of rankings in T generated by a set of recommendation algorithms A for each user u i ∈ U, to create a single list ("super-list") that will optimize a given criterion (in our case, the average precision) to the greatest extent. Such a list should, in theory, be "better" than individual lists.

Differential Evolution
In order to optimize the AP measure, the Differential Evolution algorithm was used, which is a metaheuristic developed by K. Price and R. Storn [10]. It is based on individuals, which are represented as vectors of real numbers. For this reason, it is primarily suitable for the optimization of continuous functions, although there are papers that have suggested modifications to the algorithm and its adaptation to the optimization of discrete problems [30].
There is a population P of individuals, where each individual is a solution to an optimization problem, often represented as a d dimensional vector of real-valued numbers. The initial population P can be initialized randomly and should cover the entire search space. In the classic version of the algorithm, this is assumed to have a uniform probability distribution. In order to determine how good a given individual is in the population, it is necessary to define the fitness function, which assigns a certain value to each individual in the population.
This value is later used in the selection process, which is the process of choosing which individuals should go to the next generation. With each iteration, the algorithm attempts to improve the population of individuals until the stopping criterion is reached (e.g., a certain number of iterations). Owing to the use of crossover and mutation operators [34], the population of individuals changes and the algorithm attempts to find a better solution. Mutation creates a new individual by combining three randomly selected individuals and can be expressed with the following formula: where r 1 , r 2 and r 3 are random unique individuals (r 1 = r 2 = r 3 ). The F parameter is the parameter responsible for amplification and usually takes a value in the range [0, 1].
After creating a new individual v i using the mutation operator, we use the crossover operator according to Formula (2). The CR parameter is the parameter that determines the crossover probability. Additionally, there is a rand function that generates a random number between [0, 1].

Suggested AggRankDE Method
Our AggRankDE method is designed based on the values issued by the individual recommender algorithms for each item i in the set of all items I to find a vector of the weight W that achieves the largest AP value on the training set TS. It should be noted that this vector is created for each user u i ∈ U separately, since each user has their own individual recommendation preferences. Additionally, based on our previous research, we suggest a matrix representation for the scores given by individual algorithms and the population of individuals of the DE algorithm.
Details of this representation can be found in our previous work [15], and a simple example is presented in Figure 1. As a result it is easier to parallelize the process of learning user preferences and, thus, to reduce the computation time that is needed to find the particular preference vector W.   Figure 1. Toy example of the multiplication of two matrices. Matrix A represents scores assigned by the recommendation algorithms to each item i ∈ I and some population P (real value vectors) of the metaheuristic algorithm represented by matrix B. Matrix product C represents new scores for each item i ∈ I, which, after sorting, create new rankings τ n where n ∈ {1, 2, . . . , NP}.
The hybridization technique was taken from [25] and is based on assigning weights W = {w a 1 , w a 2 , . . . , w a n } for each algorithm a h , from the set of algorithms A = {a 1 , a 2 , . . . , a n }. The aggregated value for each item is calculated according to the formula:p where w a h is the weight assigned to the algorithm a h ∈ A, with each algorithm assigning a value ofp a h (i j |u i ) to each item i j , which determines the degree of potential interest of user u i in this item. We should also remember to use the normalization technique so that all the algorithms in A can operate on the same scale. The use of the metaheuristic algorithm based on evolution is associated with the need to define the fitness function so that, in subsequent iterations, the algorithm can reward individuals who are better adapted, i.e., with a greater value of the fitness function. In our case, this will be the average precision (AP) measure calculated for the active user u A as follows: where S is the set of items recommended by the system and R is the set of items that user u A rated in TS. According to our experiments, the value of k in AP during the learning process should be defined as the number of items that the user u A rated in his TS. In our opinion, such a value is most appropriate due to the fact that it does not cause the algorithm to overfit. The details for how to calculate AP, especially in the context of recommendation systems, can be found in our paper [35]. The architecture of our system is presented below Figure 2.  Figure 2. System architecture. The recommendation process is divided into two phases. In the first phase, recommendation algorithms generate recommendations in the form of lists, and active user u A is selected with all his N items from the training set. In the second phase, a metaheuristic algorithm works (in our case DE) with the dedicated fitness function, which allows for faster calculation of item scores, on the basis of which, new rankings will be created.

Experimental Evaluation
Due to the fact that recommendations are most often presented to users in the form of a list, in our experiments, we used the average precision measure (AP) and the mean average precision measure (MAP). The AP measure is used in the context of a specific (one) user, and, in our research, it was used to compare the list of items recommended to the user with the list of items available in the test set for a given user. This allowed us to calculate the quality of the generated recommendations.
In addition, it should be noted that this measure also takes into account where the relevant items are located on the list. If the relevant items are higher (closer to the first position), then the AP value is also higher. Due to the fact that metaheuristics are computationally expensive, we chose only a certain subset of users for the experiments. We randomly selected 50 users who rated at least 150 movies in the dataset. The experiments carried out as part of this paper were performed using the popular MovieLens 100k dataset. The AggRankDE algorithm adopts four algorithms as the input: SVD, WMF, BPR and WARP. All of them are based on matrix factorization, and thus features are generated for each item and for each user on the basis of the user-item matrix.
These features are called latent features due to the fact that their meaning cannot be explained. In addition, these algorithms are considered to be the current state-of-the-art and are often used to compare research results in recommendation systems for the Top-N recommendation problem. The research environment was implemented in Python and C#, and the research was carried out on a computer with an Intel Core i5-7600 (3.50 GHz) with 16 GB RAM.

Parameters Tuning
Before creating an aggregation, the parameters of the algorithms that are included must be tuned. To this end, experiments were conducted to tune their values so that they could achieve the best possible MAP measure on the set of users used for the experiments. This is an important step, due to the fact that improper tuning of the parameters can result in the generation of poor quality recommendations. Table 1, presented below, shows the parameter values used during the tuning process.
This process consisted of first setting all parameters to the default values and then changing only one parameter that was selected for the tuning. After the process was completed, the best values were saved in the ("Best values" column in Table 1). The detailed MAP@10 values obtained during this process for various parameters are presented in tables: Table 2 (learning rate), Table 3 (regularization) and Table 4 (latent features).
The process of tuning the CR and F parameters for the DE algorithm was also performed, and the results of these experiments are presented in Tables 5 and 6. In addition, in article [10], the authors indicated that a good value for the parameter NP is a value between 5 · d and 10 · d, where d is the number of dimensions. The authors also point out that the parameter F, equal to 0.5, is usually a good initial value and this parameter typically takes a value in the range [0. 4,1]. The final values of the Differential Evolution algorithm that were used during the experiments are presented in Table 7.      Table 7. The differential evolution parameters used in the experiments.

Parameter Name Value
Population 50 Number of Iterations 500 Crossover's Probability 0.9 Amplification Factor F 0.5

Experimental Setup
In order to prepare the environment for testing, first, the data was prepared in an appropriate way. User ratings were sorted by the time in which a given rating was issued and then divided into two sets: training (80%) and test (20%). Owing to this approach, our algorithm attempts to predict the user's future preferences based on the user's previous activity. The task is not trivial due to the number of items from which we can choose items and which will later be presented to the user.
Fifty users were randomly selected for the study, where a recommendation was generated for each user, and then the results of the suggested recommendations were compared with the test sets of each user. The AP measure was used to calculate the quality of the generated recommendations, and then its value was averaged for all users selected for testing; thus, the tables show the results given using the MAP measure. In order to show that our algorithm gives good results, we compared it with other algorithms used for the rank aggregation problem, such as the Borda Count, Majority Judgement, Pairwise Method (Copeland's) and Score Voting (mean).
In the research, we additionally took into account the quality of recommendations that was achieved through algorithms that participated in the creation of aggregation. These included the Bayesian Personal Ranking (BPR) and Weighted Approximate-Rank Pairwise (WARP) algorithms, the implementation of which is available in the LightFM library [36]. In addition, the usual SVD algorithm marked in the results as "SVD" and a weighted matrix factorization (WMF) algorithm were implemented.

Results
In Section 5.1, we presented the process of tuning the parameters for the various algorithms used to create aggregations. This is an important step, due to the fact that the quality of the generated recommendations by the different recommendation techniques can largely depend on the parameters that are set. For example, by analyzing Table 4, it can be seen that the MAP value obtained was highly dependent on the number of latent features. Additionally, the research presented in Table 2 showed that the parameter "Learning rate", which is characteristic for the BPR and WARP techniques, also required tuning as opposed to the parameter "Regularization" (Table 3) where the default value (0) generated the best quality of the recommendations.
While analyzing the results presented in Table 8, it can be seen that the AggRankDE algorithm aggregated the recommendation algorithms and improved the overall quality of the generated recommendations even compared to other aggregation techniques. This is an important observation because it shows that one "super" list can be created from several lists to improve the quality of recommendations, which is consistent with the experimental results by [7].
Looking at the quality of the recommendations generated by the different recommendation algorithms, we can see that, depending on MAP@, the quality of the recommendations varies. In general, as the number of items based on which the MAP@ measure is calculated increases, it can be seen that the quality of the recommendations decreases, although the AggRankDE algorithm improved the quality of the generated recommendations in all cases.
Additionally, after the introduction of the "Random" method (Table 9), which purposefully generated poor quality recommendations, in the case of the AggRankDE, this did not significantly degrade the quality of the produced aggregation in contrast with, for example, the Borda Count method. This indicates that the AggRankDE has some resistance to weak algorithms that are used in the aggregation. Table 10 presents the improvement in the speed (in seconds) of the generated recommendations after implementing the matrix fitness function. Time is measured for a single user in the system and depends on the number of iterations. Looking at this table, it can be seen that the improvement in speed is significant, and this is due to the fact that the operation on entire matrices can be easily parallelized. This is particularly important in the context of metaheuristic algorithms due to the fact that computing the fitness function is the most costly step in this type of algorithm.   When analyzing the experimental results, the application of the DE algorithm with the hybridization technique presented in [25] produced good results. However, in our paper, we suggested how to improve it by using a dedicated fitness function to directly optimize the average precision measure and to speed up its calculation process. By assigning different weights to the different algorithms included in the aggregation, the DE algorithm optimizes the average precision measure using a weighted hybridization technique in order to obtain the highest possible value of the average precision measure on the training set.
During the testing phase, this translated into an increase in the quality of the generated recommendations. However, this process is computationally very expensive; therefore, we suggested using the matrix representation in the fitness function, which significantly accelerated the process of calculating the values for each item by the hybridization technique on the basis of which the ranking was created.

Conclusions
In this article, we presented how the Differential Evolution algorithm can be used to optimize the problem of rank aggregation in recommendation systems. The experiments were conducted on the database MovieLens 100k, and they showed that our algorithm improved the quality of the recommendations expressed by the MAP measure by 5% compared to other algorithms used for this purpose. Our research showed that, even using simple aggregation techniques, we could improve the quality of the generated recommendations.
In addition, in analyzing the research results, it can be seen that the AggRankDE algorithm is resistant to algorithms that generate poor-quality recommendations. We believe that this is due to the fact that, through the presence of a training phase in which the DE algorithm optimizes the AP measure, it is able to detect algorithms that generate low-quality recommendations and assign them correspondingly low weights, which results in them participating least in the creation of the list of recommended items.
Based on our previous work, we also suggested the use of matrix representation for the population of the DE algorithm and the values of coefficients calculated by individual aggregation algorithms for each item in the system. Such a representation makes it much easier to parallelize the process of calculating the values for individual items in the training phase on the basis of which new rankings (recommendations) are created. The calculation of the fitness function is the most expensive operation in the metaheuristic algorithms. In the context of the recommendation systems, this is particularly important, due to the relatively large data sets that are processed.
In following papers, we will increase the number of algorithms that are part of the aggregation, add more aggregation techniques and increase the number of data sets on the basis of which the research is carried out. We will also conduct a more detailed analysis of the effectiveness of our algorithm, taking into account a larger number of users, and conduct a more detailed analysis of how the parameters of the individual algorithms included in the aggregation and the model itself affect the quality of the generated recommendations.
Another interesting direction of research would be to take a closer look at the quality of the generated recommendations by particular algorithms in relation to individual users. Although the AggRankDE algorithm is more robust to algorithms that generate poor recommendations, the decrease in the quality is noticeable. Presumably, eliminating the weaker quality algorithms would generally improve the quality of the aggregation produced. We believe that the problem of rank aggregation within the context of the recommendation systems has not yet been sufficiently studied, and this will likely be the direction of our future work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: u Generic user u i Specific user u A Active user in system for which recommendations are generated U The set of all users i Generic item i j Specific item I Set of all items a h Specific recommendation algorithm A Set of n recommendation algorithms A = {a 1 , a 2 , . . . , a n } τ Generic ranking τ r i Ranking recommended to user u i by algorithm a r where r ∈ {1, 2, . . . , n} τ(i j ) The position of item i j in ranking τ T Set of n rankings T = {τ 1 , τ 2 , . . . , τ n } w a h Weight assigned to recommendation algorithm a h where h ∈ {1, 2, . . . , n} W Set of n weights W = {w a 1 , w a 2 , . . . , w a n } R Set of items that user u A rated in his training set S Set of items recommended to user u A P Population of metaheuristic algorithm NP

Number of individuals in population TS
Training set