A Comparative Study of Rank Aggregation Methods in Recommendation Systems

The aim of a recommender system is to suggest to the user certain products or services that most likely will interest them. Within the context of personalized recommender systems, a number of algorithms have been suggested to generate a ranking of items tailored to individual user preferences. However, these algorithms do not generate identical recommendations, and for this reason it has been suggested in the literature that the results of these algorithms can be combined using aggregation techniques, hoping that this will translate into an improvement in the quality of the final recommendation. In order to see which of these techniques increase the quality of recommendations to the greatest extent, the authors of this publication conducted experiments in which they considered five recommendation algorithms and 20 aggregation methods. The research was carried out on the popular and publicly available MovieLens 100k and MovieLens 1M datasets, and the results were confirmed by statistical tests.


Introduction
With the number of products and services available today, making favorable consumer decisions is ever more difficult. Users, trying to cope with this, often read reviews and comments available online, hoping that they will help them make the right choice. Unfortunately, acquiring relevant information from such a large amount of data is often very difficult and time-consuming [1]. In addition, there are also issues related to the validity and reliability of the very data from which conclusions are drawn [2].
In order to solve this problem, recommender systems have been proposed that assist the user in the decision-making process by suggesting products or services that are most likely to be of interest to him [3]. Over the years, they have grown greatly in popularity and are now often an integral part of social media platforms and auction sites. While the idea behind these systems is relatively simple, they can be very complicated and hard to implement, as they often integrate data that comes from various sources [4].
In the literature, the problem of recommendations is presented as a problem of predicting the rating that a user would give to a given item [5], or as a problem of predicting the ranking of items that would be suggested to the user [6]. Undoubtedly, the second approach is closer to the actual application of recommender systems, where usually the results of such systems are presented to the user in the form of a ranking [7]. However, it should be noted that users are more likely to select content in the first positions of such a ranking than in the last positions. For this reason, dedicated measures are used to evaluate each of these approaches, which are used to determine the accuracy of the generated recommendations [8]. It is also worth mentioning that recommendation accuracy is not the only criterion that we can use to evaluate the effectiveness of such systems. Other measures that have also been proposed in the literature include novelty and diversity [9].
Over the past few years, a number of algorithms have been proposed that are designed to generate recommendations in the form of a ranking (called "TopN recommendation algorithms") [10]. Despite years of research, no universal algorithm has been proposed to generate high-quality recommendations in all cases. In addition, when we compare the generated recommendations in the context of a particular user, these algorithms do not generate identical recommendations. For this reason, the literature suggests the use of aggregation methods, the task of whose is to aggregate the rankings, generated by individual recommendation algorithms, in order to create a new, "better" recommendation.
For example, in the book [11] [p. 417], the author indicated that this is a problem that has not yet been sufficiently studied in the context of recommender systems and is an interesting direction for future research. Similar conclusions were presented in the publication [12], where the authors pointed out that relatively few dedicated algorithms have been developed in the literature related to recommender systems that address this problem. Therefore, the authors of this publication recognize a research gap in this area, related to the study of rank aggregation methods in the context of recommendation systems.
The main contribution of this paper is to investigate which classical aggregation methods, based on supervised and unsupervised learning, produce the best aggregation. To allow easy reproduction of the experimental results, popular datasets were used for the experiments. Although in the literature there are publications that have already analyzed this problem in the context of recommender systems [12], in this paper, experiments are conducted on a bigger number of classical aggregation methods. It is important since these techniques are often used as comparative methods and presented together with the results of new algorithms proposed by researchers. This paper aims to help researchers decide which aggregation methods are worth considering when reporting the results of their experiments in the context of recommender systems.
In addition, the authors are also aware of the problem of reproducibility of experimental results that are reported in scientific publications. In the context of recommender systems, this problem has already been pointed out many times [13][14][15][16], emphasizing that due to the complexity of recommender systems and the different methods of their evaluation, reproducing the results of experiments without access to the source code is often very difficult and sometimes even impossible. With this in mind, the authors of this publication provide the research environment, which was created for the purpose of conducting experiments. It was implemented in the Python programming language, based on publicly available programming libraries.
The article has been divided into six chapters. Section 2 presents a literature review, referring mainly to the problem of rank aggregation in recommender systems. Section 3 presents a formal definition of recommendation system and rank aggregation problem. Section 4 presents details related to the research environment used, and details of the parameter tuning process. In addition, in this chapter, metrics for evaluating the quality of the generated recommendations and the evaluation protocol will be discussed. In Section 5, the results of the experiments are presented with appropriate commentary. Section 6 is dedicated to conclusions and suggestions for future work.

Related Works
The problem of rank aggregation is a well known, especially in the context of social choice theory, which deals with the analysis of collective decision-making and how to transform the preferences of individual users into the preferences of the group [17]. In the context of modern information filtering systems, the problem of rank aggregation was described by C. Dwork in his work [18], where the author presented the theoretical basis of this problem, analyzing it through the prism of information retrieval systems. In the following years, applications of this idea have been proposed in other areas of science, which are related to: combining microarray data [19], similarity search and object classification [20], and biology [21].
Within the context of recommender systems, there has been relatively little work related to this problem, and as noted in [11] (p. 417), it is a relatively under-researched field. However, it is hard to say clearly when the idea was first used in recommender systems, since it is not always defined clearly as a "rank aggregation problem" in scientific publications. It seems that the first papers using this concept were works related to hybrid systems [22,23].
Rank aggregation is primarily used in the generation of group recommendations. Group recommendations, unlike their classical counterparts, generate recommendations that are tailored to the preferences of an entire group of users, and not just to one specific user [24]. One such system is [25], where the authors presented interesting results, suggesting that for some users, group recommendations may prove better, than personalized recommendations. In the publication [26], the authors proposed an aggregation algorithm based on the Borda method. In turn, in the paper [27], the authors suggested using entropy to analyze the distribution of ratings and detect items on which group members did not reach consensus.
The rank aggregation problem is a computationally expensive problem and it has been proven that from a certain number of rankings it becomes an NP-hard problem [18,28]. Therefore, an interesting direction of research is the use of metaheuristic algorithms that allow finding an approximate solution in an acceptable time. For example, in the publication [29], the authors proposed a hybridization technique that combines recommendations generated by different recommendation algorithms, using an evolutionary algorithm for multi-criteria optimization. The publication [30] suggested an Evolutionary Rank Aggregation (ERA) algorithm that used genetic programming to directly optimize the MAP measure. The authors tested the suggested solution on four datasets, and the results clearly indicate that the technique improved the quality of the generated recommendations. In another publication [31], the authors proposed the Multi-objective Evolutionary Rank Aggregation (MERA) algorithm, which was an algorithm for multi-criteria optimization. The publication [32] suggested using the Differential Evolution algorithm, to directly optimize the AP measure for individual users in the system. This approach made it possible to find a vector, determining the preference of a given user over individual rankings. However, the main disadvantage of techniques based on metaheuristic algorithms is that, they are often difficult to implement correctly and require appropriate tuning.
What is particularly noteworthy is the publication of [12], in which the authors tried to answer the question of whether the use of rank aggregation methods in recommender systems can be effective. To this end, they conducted a systematic study in which they considered as many as 15 recommendation algorithms and 19 aggregation methods, and the experiments were carried out on seven different datasets. Analyzing the results of the study, the authors found that aggregation techniques improved the quality of recommendations on the six tested datasets.

Background of the Research
This chapter presents the basic concepts, related to the subject of this article. First, the basic information on recommender systems is discussed, and then the problem of rank aggregation within the context of these systems is presented.

Recommender Systems
The task of a recommender system is, on the basis of historical data, to predict the future preferences of users. Nowadays, they are increasingly used in various areas of our lives, from buying items on auction websites, through choosing the next movie to watch, to adding new friends on social media. However, this is not a trivial problem, and intensive research work has been carried out on the subject for many years now [33]. The most important event that significantly increased interest in this problem was a competition organized by Netflix, where researchers who managed to sufficiently increase the quality of recommendations generated were offered 1 million dollars as a prize [34].
In recommender systems, we can distinguish two main approaches to generating recommendations. They can be based on an attempt to predict what rating (e.g., on a scale of 1 to 5) a user would give to a given item in the system [35]. They may also try to predict a certain set of items, most often presented as an ordered list, that would be recommended to the user [6].
Recommender systems can be also divided into personalized and non-personalized. A non-personalized recommender system is one that, based on the global behavior of all users in the system, tries to draw some conclusions, for example, recommending to the user the movies that are most watched. Nowadays, however, personalized systems are mostly used, which, based on the historical activity of a given user, create a profile of that user, which is used to generate recommendations [36].
Formally, in a recommender system, we distinguish a certain set of users U = {u 1 , . . . , u |U| } and a certain set of items I = {i 1 , . . . , i |I| }. All interactions between users and items are recorded in a matrix R. Thus, the data can be represented as a triple (u, i, r ui ), which means that a given user u ∈ U interacted with an item i ∈ I, giving it a rating r ui . All ratings given by users to items are often represented as a user-item interaction matrix R.
In the context of recommender systems, many techniques and methods have been proposed, and for this reason, the literature has suggested dividing them into the following approaches: content-based filtering [37], collaborative filtering [38], knowledge-based filtering [39] and a combination of different techniques, a hybrid approach [40]. One of the most popular techniques for generating recommendations is the matrix factorization technique [41], which transforms the matrix R into two smaller matrices according to the following formula: In this decomposition, P is a matrix representing user features of |U| × k, and Q is a matrix representing item features of |I| × k. Then, in order to determine the user's preference u for an item i, it is required to: where p u is the feature vector for user u, and q i is the feature vector for item i. This technique allows users and items to be represented by a small number of latent features, and it has become very popular due to Simon Funk [42], who used it in the Netflix competition. An example of such factorization is presented below in Figure 1, along with Table 1 showing the recommendation algorithms used in the experimental phase.

Rank Aggregation
The rank aggregation problem refers to the situation where, having several rankings, which are ordered lists consisting of certain objects (e.g., items), our task is to create a new ranking that is "better" than the base rankings.
Formally, this problem can be presented as follows. Let us assume that we have a certain set of elements I = {i 1 , i 2 , . . . , i m }. We define a ranking as an ordered list of these elements τ = [i j ≥ i h ≥ · · · ≥ i z ], where ≥ denotes the order relation between the elements in the set I, and the relevance of an element is determined by its position. With the symbol τ(i j ), the position (or rank) of item i j in the τ ranking will be denoted. Two items i j and i h can be compared using their positions in ranking τ. For example, we can say that item i j is in a "better" position than item i h , which will be denoted as τ(i j ) < τ(i h ). In addition, a single algorithm will be denoted as a h , and the set of all algorithms will be denoted as A = {a 1 , a 2 , . . . , a n }. Each algorithm generates a ranking τ, and the set of all rankings will be denoted as T = {τ 1 , τ 2 , . . . , τ n }, where n denotes the number of algorithms and the number of generated rankings.
The goal of rank aggregation is to create a new ranking τ * , which in theory should be better than the individual rankings in the set T. The quality of a ranking should be considered in the context of a given problem, keeping in mind its specifics. For example, in recommender systems, this could mean the ranking that most improves the quality of the recommendation, where this quality can be calculated based on the measures described in Section 4.3. For unsupervised methods, however, dedicated distance measures are more often used to determine the degree of similarity between rankings (e.g., Kendall Tau distance [46]).
Therefore, the problem of rank aggregation boils down to defining an aggregate function Ψ that, based on the rankings in the set T, generates a new ranking τ * : Depending on the available data, the Ψ aggregate function can be created based on different methods. In the literature, the basic division of these methods is by score-based and permutation-based methods. With score-based methods, each element in the ranking is assigned a certain value, which determines its position in the ranking. Aggregation methods then create a new ranking τ * by combining the scores from the base rankings. Permutation-based rank aggregation methods, on the other hand, create an aggregation by searching the entire space for possible permutations of elements from the set I.
Aggregation methods can also be divided, based on the type of learning algorithm used. Methods based on supervised learning [47] create a rank model using a training set. More advanced techniques may also use an approach called "learning to rank" [48], but they are much more complex and difficult to implement, although they can obtain better results compared to other methods [12]. In techniques based on unsupervised learning, aggregation is most often created based on dedicated distance measures that allow individual rankings to be compared with each other (e.g., Kendall Tau distance) [46]. These methods are characterized by the simplicity of implementation and the fact that they do not need a training phase to operate.
Over the years, a number of different techniques have been suggested in the literature that can be used to create rank aggregation, and an overview of them is presented in the publications [12,49]. Below is an overview Figure 2 showing the aggregation process of four rankings that were generated by four recommendation algorithms. In turn, in the Table 2, a summary of the aggregation methods used in the experimental phase is presented. Each recommendation algorithm generates a recommendation, assigning a certain score to each item in the system. An aggregation method is then used, which combines the rankings to form the final recommendation τ * .

Experimental Evaluation
This chapter discusses the details of the process of conducting the research. First, the experimental setting used to conduct the experiments will be presented. Then, the process of tuning the parameters of the recommendation algorithms is discussed, and the measures used to evaluate the quality of the generated recommendations are presented. Finally, the evaluation protocol is discussed.

Experimental Setup
To make it possible to conduct the research, it was necessary to prepare a dedicated RecRankAgg research environment, since none of the existing solutions met the necessary requirements. In addition, to reduce the time needed for implementation and the chances of possible errors, existing programming libraries that already had some of the needed functionality were used. The recommender system was created based on the LensKit [43] library, which is a library that is a set of tools designed to conduct research work related to recommender systems. It has numerous functionalities, which include: loading and dividing the dataset into training and test sets, evaluating the generated recommendations using various quality measures. In addition, some implementations of recommendation algorithms are available in this library, which are presented in Table 1. The aggregation methods presented in Table 2 and used in the experimental phase, were available in the Ranx library [60,61]. The RecRankAgg experimental setting was implemented in Python, and the experiments were conducted on an Intel Core i5-7600 (3.50 GHz) computer with 16 GB RAM.
The research was conducted on two popular datasets MovieLens 100k and MovieLens 1M [62]. The number in the name of this dataset indicates the number of available ratings. The choice of smaller versions of the datasets was motivated by the fact that as the number of available ratings increased, so did the time required to train the various models of recommendation algorithms, which also lengthened the process of tuning hyperparameters. These are popular and publicly available datasets from which the results of the experiments can be easily reproduced. The MovieLens 100k dataset contains 100,000 ratings by 943 users for 1682 movies. Each user in this dataset rated at least 20 movies, on a scale of 1 to 5. By contrast, the MovieLens 1M dataset contains 100,000,209 ratings, which were given by 6040 users for 3952 movies. Similarly, as with the smaller version of this dataset, it has been cleaned up properly beforehand, and users who have rated fewer than 20 movies have been removed from it. In addition, all ratings issued in these datasets have a timestamp.

Parameters Tuning
Before generating recommendations, the parameters of the recommendation algorithms must be properly tuned. The goal of this process is to find a set of parameters that maximizes the quality of the generated recommendations (expressed using the MAP measure), using a training set (60%) and a validation set (20%).
The software used to tune the parameters, was the Optuna library [63], which allows automation of this process. The tuning algorithm was Tree-structured Parzen Estimator [64], which creates a probabilistic model based on the history of previous hyperparameter values, and then uses it to suggest subsequent hyperparameter values. To keep the tuning process from being too long, a trial limit of 100 was set in advance. Table 3 presents the parameters of the recommendation algorithms, along with their type, range of values, and the best value that was found during the tuning process. The names of the tuned parameters used are consistent with the parameter names available in the LensKit library. The results of the tuning process are presented in the form of graphs in the Appendix A and will be discussed below. Figure A1 shows the process of tuning the min_nbrs and nnbrs parameters of the Item kNN algorithm. It can be noticed that the min_nbrs parameter was unlikely to affect the quality of the generated recommendations, since it obtained high MAP values, practically for the entire range of values taken. By contrast, the nnbrs parameter affected the quality of recommendations to the greatest extent, especially when it took values in the range [15,35].
Similar results of the parameter tuning process were obtained for the User kNN algorithm, as can be seen by analyzing the Figure A2. In the case of this algorithm, the min_nbrs parameter also did not significantly affect the quality of the results, while for the nnbrs parameter, the quality of recommendations was greatest when this parameter took values in the range [35,50].
For the ImplicitMF algorithm, four parameters were tuned: features, method, reg and weight. The results of this process are shown in Figure A3. For the features parameter, the quality of the recommendation was highest when this parameter took values in the range [20,25]. The method for which the algorithm achieved the best results was the cg method. For the reg parameter, the quality of recommendations was greatest when the value of the parameter was close to the value of 0.8. In addition, large values of the weight parameter significantly affected the quality of the generated recommendations, because as the value of this parameter increased, the quality of the generated recommendations also decreased, and the optimal value turned out to be 1.
The last of the algorithms to be tuned was the BPR algorithm. Analyzing the Figure A4, it can be seen that when the features parameter took a value close to the value of 45, the quality of the generated recommendations was the highest. In addition, the neg_count parameter should take values above 10, and as the value of the reg parameter increased, the quality of the generated recommendations significantly decreased, and the best value, turned out to be 0.

Evaluation Metrics
To evaluate aggregation methods, dedicated measures will be used to determine the quality of the created ranking. These measures compare the generated recommendations with the items that are in a given user's test set. The most basic measures that can be used for this purpose are precision and recall. Precision represents the percentage of relevant items that appeared in the recommended ranking, while recall represents the percentage of relevant items that were recommended. These measures are calculated according to the following formulas: where Rel(u i ) is the set of relevant items for user u i , and τ r i @k denotes the first k items in the ranking where the recommended items are located.
The fundamental disadvantage of simple precision is that it does not take into account the position in which the relevant items are located. For this reason, to assess the quality of recommendations, an AP measure is used, which averages the precision values calculated for each item in the recommended ranking, according to the following formula: where rel u i (x z ) determines the relevancy of an item x z to a user u i . The advantage of this measure is that it penalizes incorrect ordering of items in the ranking. The average precision described above is usually used when evaluating recommendations in the context of a single user. However, we often want a single number as the result of our experiments. Therefore, a mean average precision was suggested, expressed by the means of the following formula: AP@k(τ i ).
AP and MAP measures are often used for binary values, but in a situation where there are different levels of relevance in the system and we have information on how relevant an item is (e.g., on a scale of 1 to 5), it makes sense to use a measure of normalized discounted cumulative gain. As with the MAP measure, the purpose of this measure is to reward items that are high (closer to the first position) on the recommended list, as expressed by the following formula: The measure of DCG cannot be compared between users, since each user has a different number of relevant items. For this reason, a normalization is performed that uses the ideal discounted cumulative gain IDCG, which determines the maximum value of DCG for a ranking of τ r i . Then, to obtain the NDCG measure, it is necessary to:

Evaluation Protocol
For the experiments, the recommendation algorithms presented in the Table 1 were used. Each of the algorithms A = {a 1 , a 2 , . . . , a n }, generates a ranking of τ and each user u i is represented by a collection of rankings T = {τ 1 , τ 2 , . . . , τ n }. Recommendation algorithms generate recommendations using the parameters found during the parameter tuning process described in Section 4.2.
To carry out the evaluation process, the dataset had to be properly prepared. First, user ratings were sorted by timestamp. This approach is justified [16] (p. 46) because our task is to predict the future choices of users, based on their previous activity.
Then, for each user, the items he rated were divided into three sets: training (60%), validation (20%) and test (20%). The training and validation sets were used for the process of tuning the parameters of the recommendation algorithms and aggregation methods. However, it should be noted that during the final evaluation, the training set is combined with the validation set, so the division of the sets is as follows: training set (80%) and test set (20%).
Each recommendation algorithm generated recommendations in the form of ranking of 10 items. The following measures were used for evaluation: NDCG@10, MAP@10, P@1, P@10, and Recall@10. In addition, in order to demonstrate the statistical significance of the presented results, a Fisher's randomization test was performed, and the symbol was used to indicate that a particular aggregation method, obtained statistically significant results (with 95% certainty), compared to all recommendation algorithms used in the experiments. The choice of this statistical test is consistent with the suggestions for the evaluation process of information filtering systems found in the literature [65].

Results
This chapter presents the results of experiments that were carried out using RecRank-Agg software. Due to the fact that this research was carried out on two datasets, this chapter is divided into two subsections corresponding to each dataset.

Results on MovieLens 100k
Analyzing the results presented in Table 4, it can be seen that the non-personalized recommendation algorithm, Most Popular, generated recommendations that were clearly inferior, taking into account all the measures that were used to assess the quality of recommendations. Such a result, however, is not surprising, as mostly non-personalized recommendation algorithms perform worse than personalized algorithms. The personalized algorithm that achieved the best results was the ImplicitMF algorithm. However, it should be noted that the differences in the quality of the generated recommendations, between the personalized recommendation algorithms used in the experiments, are relatively small.
Analyzing the effectiveness of unsupervised aggregation methods, it can be noted that statistically significant results, due to the NDCG@10 measure, were achieved by the following methods: CombMNZ, Bordafuse, LogISR. On the other hand, the methods of this type that formed the lowest quality aggregation were: CombMIN, CombMED, and CombANZ.
Supervised aggregation methods mostly achieved statistically significant results for the NDCG@10 measure, while for the P@10 and MAP@10 measures, statistically significant results were achieved by: Slidefuse, Bayesfuse and Posfuse. The methods that achieved the worst results were: Weighted Sum and Weighted Borda. It should also be noted that the supervised LognISR algorithm, achieved an identical result as the unsupervised LogISR algorithm.
Analyzing the results of the study, it is worth noting the results obtained by the various methods for the P@1 measure. This measure determines the precision, taking into account only item in the first position in ranking. Some of the aggregation methods (e.g., LogISR, ISR, LognISR), perform noticeably better at correctly positioning items that are in this position.

Results on MovieLens 1M
Analyzing the results of the experiments presented in Table 5, which were carried out on the MovieLens 1M dataset, it can be seen that, as with the MovieLens 100k dataset, the non-personalized Most Popular recommendation algorithm generated recommendations of significantly lower quality than other recommendation algorithms. The personalized algorithm that achieved the best results was the Item kNN algorithm. The differences in the quality of the generated recommendations between the neighborhood-based algorithms (User kNN and Item kNN) were insignificant. The same situation occurred for algorithms based on matrix factorization (ImplicitMF and BPR).
Analyzing the effectiveness of the unsupervised aggregation methods, it can be noted that due to the NDCG@10 measure, the methods that reached statistical significance are: LogISR, Bordafuse, CombSUM, CombMNZ and ISR. For this dataset, the aggregation methods that generated the lowest quality aggregation were: CombMIN, CombMED, CombANZ and Condorcet. Therefore, it can be seen that for both datasets, virtually the same methods created low-quality aggregations.
Almost all of the supervised methods achieved statistically significant results for the NDCG@10 measure and MAP@10. For this dataset, it is also noteworthy that some of the unsupervised methods achieved results similar, to the supervised methods.

Conclusions
This article presents the results of our research, the aim of which was to test the effectiveness of aggregation methods in recommendation systems. Five recommendation algorithms and 20 aggregation methods (10 supervised methods and 10 unsupervised methods) were used to conduct it. The process of parameter tuning was also discussed and the RecRankAgg experimental environment was provided for easy reproduction of the performed experiments. In addition publicly available MovieLens 100k and MovieLens 1M datasets were used in the study.
The results of the experiments were confirmed by statistical tests and clearly indicate that aggregation methods can be successfully used in the context of recommender systems. However, it should be noted that their effectiveness varies. In general, better results can be obtained using supervised algorithms, but interestingly, some unsupervised techniques obtained results similar to supervised techniques. It is a very interesting observation and we intend to conduct a more detailed analysis of such cases in the future. In addition, it is important to keep in mind that the parameters of the recommendation algorithms should be properly tuned before creating aggregations, since, as presented in Section 4.2, the influence of individual parameters varies greatly.
To help researchers choose which aggregation algorithms to consider when reporting experimental results, the following recommendations have been made: • Based on the analysis in Section 5. Although a direct comparison of the results obtained in this article with the results presented in the publication [12] is not possible (because a different number of recommendation algorithms were used), it is possible to notice some interesting similarities in the presented results. For example, when analyzing the results of experiments on the Movie-Lens 1M dataset (presented in the online appendix of this publication [12]), it can be seen that some supervised algorithms obtained results similar to unsupervised algorithms. This observation coincides with the results presented in our paper. The authors in [12] noted that supervised algorithms could generate aggregations of higher quality than unsupervised ones when they have access to diverse rankings. However, a closer analysis of such cases seems reasonable.
When comparing the results of experiments, it should also be noted that the problem of rank aggregation is quite complex since the aggregation is performed based on previously generated rankings. Although a dedicated research environment RecRankAgg has been prepared for this paper, which significantly facilitates the reproduction of the experiments performed, it is also a good idea to create ready-to-use datasets with previously generated rankings in the future. It would provide an easy way to conduct and reproduce the results of the experiments for other researchers.
In the future, we intend to use more diverse datasets and include more recommendation algorithms in our research. It is also an interesting direction for future work to see which aggregation algorithms perform best with low-quality rankings. It can be done by including in the aggregation process algorithms that generate recommendations of very low quality (e.g., random recommendations). Another interesting direction of research is to see how different variants of the normalization process, affect the quality of the created aggregation. In addition, in the research conducted, the recommendation algorithms generated rankings that consisted of only 10 items. It would also be reasonable to check how the number of items that are recommended by different recommendation algorithms affects the quality of the created aggregation.
Another interesting direction for future research is to consider the problem of rank aggregation from the perspective of consensus theory. For example, in the literature, some papers propose a dedicated measure for calculating the consensus between rankings [66]. Attention is also paid to the problem of the so-called "fair consensus" [67], in which it is considered whether aggregation can introduce disadvantageous bias to particular groups. Dedicated algorithms have been proposed to solve this problem, and their effectiveness has been tested on real-world datasets [68].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: u Generic user u i Specific user u A Active user in system for which recommendations are generated U The set of all users i Generic item i j Specific item I Set of all items a h Specific recommendation algorithm A Set of n recommendation algorithms A = {a 1 , a 2 , . . . , a n } τ Generic ranking