A Forward-Looking Approach to Compare Ranking Methods for Sports

: In this paper, we provide a simple forward-looking approach to compare rating methods with respect to their stability over time. Given a rating vector of entities involved in the comparison and a ranking indicated by the rating, the stability of the methods is measured by the change in rating vector and ranks of the entities over time from a forward-looking perspective. We investigate various linear algebraic rating methods and use the Euclidean distance and Kendall tau rank correlation to measure their stability in rating and ranking, respectively. The investigations are based on both rolling and expanding window approaches. We apply the methodology to sports as a widely known ranking and rating environment. The results suggest that PageRank and Massey rating methods provide better rating and ranking stability than simple methods, such as winning percentage, and more advanced ones, such as Colley’s least square and Keener’s eigenvector-based method. Finally, a simple way to examine the potential predictive power of the rating methods is also provided. Author Contributions: Conceptualization, P.J.O., A.L. and M.K.; methodology, A.L.; software, P.J.O.; validation, P.J.O.; formal analysis, P.J.O.; investigation, A.L. and M.K.; resources, M.K.; data curation, P.J.O.; writing—original draft preparation, P.J.O. and A.L.; writing—review and editing, M.K.; visual-ization, P.J.O.; supervision,


Introduction
Rating items is a fundamental task that aims at providing a ranking and making decisions according to it. For instance, in sports, the ranking of players or teams is provided by some scoring system, such as 'three points for a win and one for the draw' in soccer, or more complex systems such as Élő in chess or ATP ranking in men's tennis. For a good book on ratings in sports, see, e.g., [1].
Many different rating methods have been developed, and all of them are based on some assumptions or formal axioms that have to be satisfied by the rating; see, e.g., [2][3][4][5]. In the case of sports, rating methods are also considered as key elements of making singlegame outcome predictions; see, e.g., [6].
Although the literature on rating and rating-based predictions in sports is vast, only a few papers can be found that address the problem of evaluating and comparing the stability and robustness of rating methods over a season in a round robin-like system, such as soccer or US major sports. For some related papers, see, e.g., [5,7,8]. Our study, however, is different from the above ones, as outlined in the following paragraph.
The paper [5] focuses on the general properties of sport ranking including Colley, Win-loss, Elo, and Markov methods. The authors evaluate the ranking of these methods in relation to properties such as Opponent Strength, Incentive to Win and Sequence of Matches. In our case, we propose a new comparison method based on a forward-looking approach to evaluate the ranking and rating stability of selected common ranking methods. In [7], the authors empirically evaluate the predictive power of eight sports ranking methods. Although we evaluated similar ranking methods such as PageRank, Winning Pacentage, Rating Percentage Index, and Keener, our comparison approach is different mainly by the stability measures using forward-looking approach. Our investigations may be considered as a meta analysis for predictive power studies: we hypothesize that there is a relation between predictive power and stability/robustness, and we consider our study as an initial step in this direction. Finally, in [8], the authors focus on the analysis of the sensitivity measure of the rating vectors of three linear-algebra-based ranking methods including Colley, Massey, and Markov methods. The authors employ reverse engineering of a simple input ranking vector that they use to build a perfect season to determine the output rating vectors produced by the three methods to measure the sensitivity. This is also a different technique from our approach.
The stability problem has also been addressed in the literature of network science, especially in the case of centrality measures; see, e.g., [9][10][11]. Since many rating methods can be interpreted as network centrality measures, investigating the stability problem for ratings in the sport domain is a convenient next step in this direction.
In this study, we propose a simple forward-looking approach to actively compare rating and ranking methods with respect to their robustness and stability over time. Informally, a rating (or ranking) method is considered to be stable over time if the differences between the rating (or ranking) vectors obtained for the consecutive time periods are steadily small, using proper functions to calculate the difference. Our approach is a forward-looking one in the sense that stability is measured from a future perspective: if a rating 'at present' is closer to the rating obtained at some future time point, this indicates stability. This study attempts to evaluate and compare ratings and rankings by dynamically modifying the dataset (used to calculate ratings) using rolling and expanding window simulations, respectively.
The rest of this paper is organized as follows: In Section 2, we formally discuss several commonly used rating and ranking methods that we use in our simulations. In Section 3, we describe the evaluation framework and comparison methods. In Section 4, we discuss the simulations results on some European football league datasets. Finally, we conclude and address some future research directions in Section 5.

Rating and Ranking Methods
In this section, we give a short description of the ranking methods we will use. For a more detailed introduction about ranking methods, refer to [12,13].
Let V = (1, . . . , n) be the set of n teams to be rated, and let R be the number of rounds in a competition among the teams in V. After round r (r = 1, . . . , R), a rating function φ r : V → R assigns a score to each team which we may call their quantitative 'strength'. A ranking σ r : V → V is an ordering of the teams simply obtained from a rating on V by a proper sorting. For rating the teams, we consider only the final scores of the games played.
We define the n × n matrices W and D as W ij = #{i won against j}, and D ij = #{draws between i and j}.
The score matrix S ∈ R n×n is defined as S ij = #{ points i scored against j}.
To avoid fully zero rows in S, we consider S ij = S ji = 1/2 if the outcome of the game is 0:0. Using W matrix, the elements of the vectors w = W1, l = W t 1, d = D1, and t = (W + W t + D)1 are the number of wins, loses, draws, and total number of games played by team i (i = 1 . . . , n), respectively, where 1 is the n-element vector with all entries being one. Since each game is either a win, a lose, or a draw, t = w + l + d. We define T = diag(t i ), which is the diagonal matrix with entries T ii = t i and T ij = 0, if i = j (i, j = 1 . . . , n). Similarly, we may define the vectors s = S1, u = S t 1 as the total number of scores by team i against the opponents and by the opponents against team i, respectively.

Winning Percentage (WP)
The winning percentage of team i after round r is simply defined as φ r WP (i) = (w i + κd i )/t i , where κ is a parameter between 0 and 1 and can be interpreted as the 'value' of a draw. For example, if we take κ = 1/3, it refers to the fact that the value of a draw is one third of the value of a win. The vector of winning percentages after round r can be computed as φ r WP = T −1 (w + κd). By considering the score matrix S, a similar quantity can be calculated as φ r WP(S) = T −1 s. Observe that this method does not take into consideration the strength of the opponent teams; only the outcome games count.

Rating Percentage Index (RPI)
The Rating Percentage Index takes into account the WP of the team's opponents and the WP of their opponents' opponents [14]. The average winning percentage of team i's opponents after round r is calculated as where the average is taken over the set of the team's previous opponents after round r. The vector of the average opponents' winning percentages is T −1 (W + W t + D)φ r WP . The winning percentages of the opponents' opponents can be calculated as T −1 (W + W t + D) 2 φ r WP . After round r, RPI vector is calculated as the following weighted average: and similarly, given score matrix S, as

Massey's Least Squares Method (M)
The only statistics used in Massey's least squares method [15] are the number of wins and losses for each team. The rating φ r M of the teams after round r is obtained by the solution of the linear system contains the total number of games played by the teams in the diagonal, while M ij is −1 times the number of games played between teams i and j, i = j. The method naturally incorporates draws, since a draw between two teams increases M ij and M ji by one, while the right-hand side w − l remains unchanged. Since rank(M) < n, the linear system does not have a unique solution. To handle this problem, one possible solution is to replace any row in M with 1 and the corresponding entry of w − l with zero.

Colley's Least Squares Method (C)
The Colley method is also a modification of the least squares method utilizing an observation called Laplace's rule of succession (see [16], p. 148), which states that if one observed k successes out of m attempts, then (k + 1)/(m + 1) is a better estimate for the next event to be a success than k/m. The rating vector φ r C of the teams is the solution of the linear system Cφ r C = b, where C = M + 2I (here, I is the identity matrix) and b = 1 + 1/2(w − l). It can be proved that the linear system has a unique solution.

Keener Method (K)
Keener's method [17] is a so-called spectral rating method which uses the Perron-Frobenius eigenvector for the rating, and (after round r) it is given by the solution of the eigenvalue equation where λ is the dominant eigenvalue of the matrix T −1 (W + κD), and it exists for a matrix with non-negative entries such that any other eigenvalue is smaller in absolute value. The corresponding eigenvector, called the Perron-Frobenius eigenvector, has non-negative entries and provides the rating of the teams. Originally, the method was defined for the case in which we consider the score matrix S. The Keener matrix, also based on the Laplace's rule of succession, is defined as where h is a skewing function helping to reduce the difference between the upper and lower ends of the rating. We use the original function defined by Keener, namely, The Keener rating vector φ r K(S) of the teams is given by the solution of the equation

PageRank Method (PR)
The PageRank method [18] was originally designed to rank web pages based on their position in the WWW network. The idea behind it came from the basic properties of Markov chains (see, e.g., [12], Chapter 4). In the context of sports, the rating of the teams is calculated in an iterative way using the recursion formula where N + (i) is the set of teams defeated by team i at least once, w j is the total number of wins of team j, and λ ∈ [0, 1] is a parameter (usually 0.1 or 0.2) to guarantee convergence. To see the relationship between the PageRank formula and the theory of Markov chains, we may write the above equation in a vector equation form as where PR PageRank vector contains the PageRank values of each team, D is the diagonal , while I is the n × n identity matrix. Assuming that 1PR = 1 implies that PR = MPR, with M = λ/n11 T − (1 − λ)SD −1 . This shows that PR is the eigenvector of matrix M for eigenvalue one, which is the largest eigenvalue of M as a consequence of the Perron-Frobenius theorem for row-stochastic matrices. The rating vector φ r PR(S) of the teams after round r can be calculated using, for instance, the power iteration method.

Graph Representation of the Methods
We shall emphasize that all the above-defined methods have a graph theoretical interpretation. Using the game results data set, one can define a directed multigraph, where nodes represent players/teams, while edges between them represent outcomes of games they played. The edges are directed and each of them is going from the loser team to the winning team. If ties are also considered, they can be represented by two directed links with opposite directions and half or some fractional weight. In this case, matrix W is the adjacency matrix of the directed multigraph, and w and l contain the in-and out-degrees of nodes, respectively. From a network science perspective, Massey's M matrix is the graph Laplacian if the result matrix is treated as the matrix of a symmetric undirected graph. The Massey rating vector φ M is then equivalent to the potential vector over a resistor network defined by W with supply vector w − l [19]. The PageRank method is a simple modification of the classic PageRank algorithm, performed on the results graph.

Evaluation and Comparison of Rating Methods
In this section, we present the applied simulation approaches and the definitions of the stability of ratings and rankings as well as the rating error. To deal with the dynamic nature of sport competitions, we perform rolling window (RW) and expanding window (EW) simulations, described as follows.

Rolling Window Approach
Let W t (or S t ) be the results matrix generated just after t games (here t = 50, 60, 70, . . .). Let φ t RW be the rating vector after t games played. We generate the results matrix W ∆t,t+∆t RW with the fixed number of games (window length) ∆t and calculate the rating φ (∆t,t+∆t RW for the new matrix using the same rating method. For example, if ∆t = 10, then games from 1 to 50, 11 to 60, etc., are considered to create the results matrix and ratings.

Expanding Window Approach
In the expanding window case, let W (T,∆t) EW (or S (T,∆t) EW ) be the result matrix generated by an incremental number of games starting from the first T games with expansion factor ∆t. For instance, if starting from T = 50 with expansion factor ∆t = 10, then W (T,∆t=10) EW is the result matrix generated considering the first 50, 60, 70, etc., games from the beginning of the competition. The team rating after game t is given by φ t EW .

Rating Stability
To measure the stability of the considered methods, we compute the Euclidean distance between consecutive rating vectors obtained by either the rolling or the expanding window approach with specified ∆t values [20]. Formally, we calculate where || · || 2 denotes the Euclidean norm. If we average d 2 RW (t) for all t = 50, 60, . . . with k = 1, 2, . . . we obtain a single value representing the average stability of the rating method over the whole competition or up to a given round. The stability in the case of the expanding window approach is measured similarly.

Ranking Stability
To measure the stability of rankings generated by the applied rating methods, we measure rank correlations using the Kendall tau method [21]. Given two consequtive rankings, σ t RW = σ 1 and σ (t+∆t) RW = σ 2 , the Kendall tau distance is defined as where σ 1 i and σ 2 i is the rank of team i in ranking σ t RW and σ (k∆t,t+∆t) RW (t = 50, 60, . . . with k = 1, 2, . . . ), respectively.
We can average τ t RW for all t = 50, 60, . . . : we obtain a single value representing the mean stability of the ranking method over the whole competition or up to a given round. The stability in the case of the expanding window approach can be measured similarly.

Rating Error
We also estimate the potential predictive power of the rating methods in a simple way. Each dataset is divided into two subsets: a training set and a test set. For the training set, a rating φ t is calculated for t games (t = 50 fixed in the case of rolling window approach, while t = 50, 60, . . . in the case of the expanding window approach). The test set consists of the next ∆t games (∆t = 10 in our simulations). We define the prediction error E t φ of a rating method φ as the proportion of games in the test set, such that the lower-rated team beat the higher-rated one, i.e., The total error is calculated as the average of the errors obtained for each train and test set sample.

Results
We performed our experiments using English Premier League Datasets (source: https: //www.kaggle.com/datasets/saife245/english-premier-league (accessed on 15 March 2022)). The datasets contain the date of the game, the name of the teams, the home and away scores, and the total points of the teams during the competition. To generate the results matrices (graphs), we used W matrix in the case of PageRank, Massey, Colley, WP, and RPI methods. In the case of the Keener Method, we considered the Score matrix S. We performed rolling window (RW) and expanding window (EW) simulations to analyze the ranking and rating stability using the Kendall tau and Euclidean methods, respectively. The results are presented via tables and plots in this section.

Comparison of Top-5 Teams Ranking by Rolling Window Approach
First, we compared the rankings and ratings of the top-5 teams using our rolling window and expanding window approaches. Here, we considered standard deviation in rating the top-5 teams at different window times (games). Table 1 summarizes the rolling window results. In all the investigated windows (10-60, 20-70, and 30-80 games), Man. City was rated and ranked the best team among the top-5 teams by PageRank (sd ± 0.0522; sd ± 0.0116; sd ± 0.0125), Massey (sd ± 0.0333; sd ± 0.0409; sd ± 0.0418), and Keener (sd ± 0.0328; sd ± 0.0434; sd ± 0.0482), while the Massey and Keener methods ranked and rated Man. United as the second best team among the top five. On the other hand, Man. City and Man. United were rated and ranked as the first and second teams, respectively, by the WP (sd ± 0.2418; sd ± 0.2256; sd ± 0.2097) method in all the windows. In general, using our rolling window approach, we can observe that PageRank, Massey, and Keener perform relatively better compared to other investigated ranking methods (see Table A1 in Appendix A). These three ranking methods recorded a relatively small standard deviation. Small standard deviation at different windows implies small variation in team rating, hence rank-rate stability and vice versa.

Comparison of Top-5 Teams Ranking by Expanding Window Approach
Next, we compared the rank-rate of the top-5 teams using expanding window approach. Table 2 shows the summary of results. According to the analysis after 60 and 70 games, Man. City and Man. United were rated and ranked the best teams among the top-5 teams by PageRank (sd ± 0.0165; sd ± 0.0174; sd ± 0.0176), Massey (sd ± 0.0242; sd ± 0.0226; sd ± 0.0210), and WP (sd ± 0.0971; sd ± 0.0865; sd ± 0.0737). Both WP and PageRank rated and ranked Man. City as the best team among top-5 teams in all windows, while Arsenal was rated and ranked the best team in all windows by Colley (sd ± 0.0773; sd ± 0.0754; sd ± 0.0775). In general, using our expanding window approach, we can observe that PageRank and Massey perform relatively better compared to other investigated ranking methods (see Table A2 in Appendix A), although the PageRank and Massey methods recorded a relatively small standard deviation. PageRank was more stable in ranking compared to the other investigated methods. As mention in Section 4.1, a small standard deviation for team ratings implies small variation in team ranking, hence rank-rate stability.

Rating Stability
We evaluate the rating stability based on the Euclidean distance measure described in Section 3.3. In this analysis, we compute Euclidean distance between two consecutive rating vectors obtained by rolling and expanding window approaches, respectively, to measure their similarity or deviation. In this scenario, the mean stability of the rating methods is based on average Euclidean distance d 2 RW (t) and d 2 EW (t) (t = 50, 70, . . . ). In this case, we measure the mean distance between team rating vectors. The lower the d 2 (t) value, the more stable the rating method.

Evaluation by Rolling Window Approach
We measure the distance d 2 RW (t) between two consecutive rating vectors. According to the results in Figure 1, for rolling window simulation, the distance values d 2 RW (t) tend to change over time (i.e., on each window/game). Generally, PageRank and Massey recorded low average values of d 2 RW = 0.025 and d 2 RW = 0.029, respectively. On the other hand, Colley, Keener, WP, and RPI recorded higher distance values with an average of d 2 RW ≥ 0.035. This implies those methods have lower rating stability due to high deviation (i.e., low similarity) in rating vectors.

Evaluation by Expanding Window Approach
We also compared and evaluated the rating stability of the investigated methods using the expanding window approach. Similarly, we measure distance d 2 EW (t) between two consecutive rating vectors at incremental window size (i.e., after 50, 60, 70,. . .) as described in Section 3.3. The results in Figure 2 suggest that the distance values d 2 EW (t) for expanding window simulation increase over time (i.e., on each window/game). Again, PageRank and Massey recorded low average distance values ranging between d 2 EW = 0.025 and d 2 EW = 0.03. Again, low d 2 EW value implies low deviation (i.e., high similarity) in rating vectors and hence a high rating stability. Similarly, Colley, Keener, WP, and RPI recorded slightly higher average distance values ranging between d 2 EW = 0.035 and d 2 EW = 0.040. This indicates that those methods have lower rating stability due to high deviation (i.e., low similarity) in rating vectors.

Ranking Stability
As mentioned in Section 3.4, we compared ranking stability for the investigated methods using rolling window and expanding window based on the Kendall tau method. Here, we consider rank correlation coefficient τ taking values between −1 and +1, which characterizes the degree of ranking stability (i.e., agreement between two rank lists). Statistically, τ measures the similarity (concordant and discordant) of two rank lists. The values of τ = +1 indicate the highest possible ranking stability, i.e., the two rank lists are exactly the same, while τ = −1 indicates low ranking stability, i.e., the two team rank lists are exactly the opposite, and τ(r) = 0.00 implies that one rank list is a random reordering of the other.

Evaluation by Rolling Window Approach
According the result in Figure 3, PageRank and Massey recorded the highest rank correlation, τ RW ≥ 0.60 and τ RW ≥ 0.80, respectively. On the other hand, both Colley and Keener recorded a rank correlation of τ RW ≈ 0.60. However, WP and RPI recorded a low rank correlation, i.e., τ RW ≤ 0.60. In general, PageRank, Colley, and Massey have relatively stable ranking performance compared with Keener, WP, and RPI, which tend to be unstable over time (at different windows/number of games). τ RW ≤ 0 implies that all six investigated ranking methods show ranking stability using our rolling window approach.

Evaluation by Expanding Window Approach
We further compared the ranking stability of all the investigated ranking methods using the expanding window approach. According to the result in Figure 4, PageRank, Colley, Massey, and Keener methods recorded a higher rank correlation value of τ EW ≤ 0.60 with PageRank recording highest values of τ EW ≥ 0.70. WP and RPI recorded a relatively low rank correlation value of τ EW ≤ 0.60. Overall, the analysis indicates that as we increase/expand the window size (i.e., number of games), the rating stability tends to increase over time.

Rating Error
As described in Section 3.5, we evaluated the predictive power of the rating methods using a very simple and intuitive approach. For the training set, we considered a fixed number of games (50) or incremental number of games (50, 70, . . .) with respect to rolling window and expanding window simulations. For the test set, a fixed number of games (10 in our case), played right after the games considered in the training set, was used.
According to the evaluation results, the rating errors are shown in Table 3. It was evidenced that PageRank and Massey had a low average rating error, that is, E φ & ≤ 0.2568 and E φ & ≤ 0.2819, respectively. This leads to the hypothesis that both PageRank and Massey rankings had higher predictive power than the others. A more detailed comparison of rating error can be seen from Figures A1 and A2 in Appendix B.

Discussions
To gain a deeper insight into how some widely used rating systems work, we compared the rating and ranking performance of six rating methods. We applied a forward-looking approach to compare and evaluate their ranking and rating stability. In our experimental investigations, we considered the 2014 English Primer League dataset for simulations (similar to NFL data used in a related study [12], or US major sports data used in [7]). Our approach provides an efficient tool to compare and evaluate the stability of ranking or rating of teams obtained by different methods.
We used a distance-based approach to compare the rating stability utilizing the Euclidean distance measure. It takes into consideration the difference of the consecutive rating vectors. Rating methods with small deviation measures tend to have higher rating stability [22,23]. According to the results in Figures 1 and 2, PageRank generally recorded low deviation measures in both rolling window and expanding window simulation.
The results of the evaluation of ranking stability by rolling window and expanding window are presented in Section 4. Among the six methods we examined (PageRank, Colley, Massey, Keener, WP, and RPI), we observed the difference of ranking results at different time windows and window sizes using Kendall tau rank correlation. Some rating methods, such as WP and RPI rank are similar compared to the others. If we consider the round-robin tournament, the rank correlation coefficient changes irregularly over time at different window sizes.
We also conducted a comparison of rank-rate performance providing some new insights into the functionality of rating systems (see Tables A1 and A2, Appendix A).
When we considered an increasing time window (by a constant factor), we observed that the Kendall tau rank correlation stabilized over time. This implies that the overall ranking is becomes generally more stable when approaching the end of the competition.
According to the prediction error results in Table 3, for the rolling window simulation, PageRank and Massey methods recorded a low mean prediction error of 0.257 and 0.282, respectively. On the other hand, WP (0.472) and RPI (0.481) recorded higher prediction errors. Further evaluation of the prediction error based on the expanding window approach shows a similar trend. However, PageRank and Massey recorded slightly higher prediction errors in this case, being 0.283 and 0.324, respectively. In contrast, Colley, Keener, WP, and RPI recorded slightly low prediction errors compared to the rolling window case. Colley, Keener, WP, and RPI tended to predict better using the expanding window approach (see Appendix B).
We have also seen that prediction error depends on the rating and ranking stability of the methods. Stable rating methods tend to record low prediction errors compared to less stable methods, in agreement with the findings in [24]. Generally, the findings of this study, in agreement with the related literature, suggest that PageRank is a more stable and robust rating method in the sport domain compared to the other five methods. PageRank, which was developed originally in the search engines domain [18], has been applied in various other domains as well as in sports. Just to mention some related studies, a timedependent PageRank was also used for ranking sports tournaments [25,26]. PageRank was also applied on randomized sports data to rank teams and individual players in sports [27]. Our findings, in general, coincide with the previous ones showing the distinguished capability and performance of PageRank in rating and ranking compared to most of the other approaches.

Conclusions
This study presents a forward-looking approach to compare and evaluate six basic rating methods with two different simulation scenarios, namely a rolling window and an expanding window approach, respectively. Rank-rate comparison indicates that the PageRank and Massey methods are consistent and robust in rating and ranking teams in both rolling and expanding forward-looking approaches. Evaluation of ranking stability by using Kendall tau correlation coefficients shows that PageRank has a high rank correlation coefficient. This indicates its stability in ranking over time. Similarly, evaluation of rating stability by the Euclidean distance measure indicates both the PageRank and Massey methods have only a small change in distance measure in both simulation setups, hence showing a high rating stability in general. Evaluation of rating error suggests that PageRank has high predictive power in both rolling and expanding window simulations. In general, the PageRank and Massey methods performed well in both rolling and expanding window tests. Nevertheless, further comparisons may be needed to test their rating stability as well as their robustness in other applications.  Table A2 shows the extended results of comparison of the rank-rate and the standard deviation of the top-5 teams by expanding window after 50, 60, 70, 80, 90, and 100 games.

Appendix B
Below is a supplementary detailed illustration of rating errors for different tests. Samples were obtained from rolling and expanding window approaches. Figure A1. Rating error at different window times for PageRank, Colley, Massey, Keener, WP, and RPI methods by rolling window approach. E φ (t) measures the spread of the team rating. A lower E φ (t) indicates high prediction power and better rating performance, while larger E φ (t) indicates low prediction power and hence low rating performance. Figure A2. Rating error at different window times for PageRank, Colley, Massey, Keener, WP, and RPI methods by the expanding window approach. A lower E φ (t) indicates high prediction power and better rating performance, while larger E φ (t) indicates low prediction power and hence low rating performance.