Machine Learning-Based Identiﬁcation of the Strongest Predictive Variables of Winning and Losing in Belgian Professional Soccer

: This study aimed to identify the strongest predictive variables of winning and losing in the highest Belgian soccer division. A predictive machine learning model based on a broad range of variables (n = 100) was constructed, using a dataset consisting of 576 games. To avoid multicollinearity and reduce dimensionality, Variance Inﬂation Factor (threshold of 5) and BorutaShap were respectively applied. A total of 13 variables remained and were used to predict winning or losing using Extreme Gradient Boosting. TreeExplainer was applied to determine feature importance on a global and local level. The model showed an accuracy of 89.6% ± 3.1% (precision: 88.9%; recall: 90.1%, f1-score: 89.5%), correctly classifying 516 out of 576 games. Shots on target from the attacking penalty box showed to be the best predictor. Several physical indicators are amongst the best predictors, as well as contextual variables such as ELO -ratings, added transfers value of the benched players and match location. The results show the added value of the inclusion of a broad spectrum of variables when predicting and evaluating game outcomes. Similar modelling approaches can be used by clubs to identify the strongest predictive variables for their leagues, and evaluate and improve their current quantitative analyses.


Introduction
The numerous unique chains of dynamic interactions between players during soccer games can be reduced to different performance indicators, to allow more insight into game performance [1]. A performance indicator is a selection, or combination, of action variables aiming to define some or all aspects of performance, and can be used to assess the performances of an individual, a team or the elements of a team [2]. Generally, sports performance judgments are prone to bias [3], as good outcomes are attributed to internal causes and bad outcomes to external causes [4]. Better insights into performance indicators could therefore not only aid the decision-making process of coaches and players with regard to training and game preparation, but also help other stakeholders related to the team, such as scouting and management, to correctly evaluate team and player performances [5].
Previously, several performance indicators have already been linked to performance in soccer, such as ball possession [6][7][8][9], number of passes [10,11], number of shots [6,8,10,12], number of shots on target [6,8,12], entries into the penalty box [9,13] and successfulness in duels [9]. Previous studies investigating performance indicators in soccer, however, often suffer from limitations and/or methodological problems, such as small sample sizes and univariate analyses of the observed variables [6]. If presented in isolation, a single set of indicators representing the performance of an individual or a team can give a distorted impression of a performance by ignoring other, more or less important, variables [2], which likely explains differences between previous studies [6]. It is therefore important that variables are not considered in isolation when relating performance indicators to game performance.
Technical innovations have led to an increasing availability of different performance indicators. Advanced metrics can currently be calculated in soccer using tracking data [14]. As tracking data results in millions of data points per season [15], calculating these metrics challenges data management and analytical methods of analysts [16]. Advanced metrics based on tracking data are also often provided by professional sports data companies, but obtaining these metrics is usually not free. So although metrics based on tracking data may better capture the complex nature of soccer [14], obtaining these metrics may at the moment not be feasible because of pratical and/or financial reasons for many teams. Furthermore, although there has been an increased interest into the utilization of tracking data by academics, practitioners still tend to use more traditional performance indicators [17].
In contemporary soccer, there is a wealth of different performance indicators, describing technical, tactical and physical performances, as well as contextual information. In order to provide to-the-point information to the coaching staff, a selection of the most important performance indicators can be helpful. Therefore, this study aimed to identify the strongest predictive variables of winning and losing in soccer using a broad range of variables. Machine learning is used instead of inferential statistics, as machine learning is better at handling large amounts of input variables [18]. Based on previous research, it can be hypothesized that shot-related variables such as shots and shots on target are closely related to performance in soccer; however, it is also expected that other indicators will be amongst the most important predictors.

Sample
All data was collected in the highest Belgian soccer division, over the course of 3 seasons (2017-2018, 2018-2019, 2019-2020), totalling 771 games. It should be noted that, because of the COVID19 pandemic, the last part of the 2019-2020 season was not played. Games with missing values because of measurements errors, such as missing physical data and/or technical data, and games that resulted in a draw, were excluded from the analysis. This resulted in the availability of 576 games for the analysis. All games were tracked using the SportVU system (Stats Perform, Chicago, IL, USA), an optical tracking system using three high-definition camera's [19]. Consent was given by STATS Perform and the Belgian Pro League for the use of the data for scientific purposes. The reliability of the data delivered by professional sports data companies, is shown to be high [20,21]. The study was conducted in accordance with the Declaration of Helsinki.

Variables
A total of 100 variables were included into the analysis (Table 1). Variables were selected based on a combination of expert knowledge and availability. All variables, with the exception of contextual game information, were derived from three sources, STATS Viewer, STATS Dynamix and STATS Edge, all offered by STATS Perform. Data regarding the teams' playing styles and Expected Goals were derived from STATS Edge. Definitions on Playing Styles and the calculation of Expected Goals can be found on the website of STATS Perform (https://www.statsperform.com, accessed on 28 June 2020). Physical game data was available at STATS Dynamix. The default speed, acceleration and deceleration thresholds for physical data were used, including the minimum effort duration of 0.5 s set for variables related to speed, accelerations and deceleration. All other variables, with the exception of contextual game information, were derived from STATS Viewer. Inside the STATS Viewer software, the user can derive its own metrics, based on criteria such as destination and direction. Direction could be set in three different directions, all relative to the opponent's goal. A ball played in an angle of −45 • to +45 • was defined as forward, backwards was defined if the ball was played in an angle from −135 • to +135 • relative to the opponent's goal. Balls played in an angle of +45 • to +135 • , and −45 • to −135 • were defined as sideways. Contextual variables were obtained from other sources. ELO-ratings were included as a measure of team strength, provided by the API of http://www.clubelo.com, accessed on 1 July 2020, which is freely available. This source provides ELO-ratings for each team of over 50 (inter)national leagues, including the UEFA Europa League and Champions League, and has previously been used for scientific purposes [22,23]. Each game, both national and international, results in an exchange in ELO-points (Equation (1)), where dr is the difference in ELO-rating between two clubs and R is the result (1 for win, 0.5 for draw and 0 for loss). The exchange in ELO-points is higher for a win against a stronger team compared to a victory against an equally strong or weaker team, and vice versa for losses. This equation was derived from http://www.clubelo.com, accessed on 1 July 2020.
Form was defined as the difference in a clubs' current ELO-rating compared to the ELO-rating before their previous game. Market values, ages and nationalities were obtained from https://www.transfermarkt.co.uk, accessed on 1 July 2020. These variables were previously used by [12] for scientific purposes. Lastly, the number of days between games was used as a variable, also including other competitive games such as cup and European games, to consider the possible effect of additional games for one team compared to the other on winning or losing.
Differences between the two teams competing during each game were calculated and used as input variable in this study, with the exception of match location. During games, there are interactions between the two competing teams [1], and these interactions will result in different performance statistics for both teams. The difference of each performance statistic between the two teams may provide insights into the difference in performance by the two teams on the pitch, and therefore be more informative to the model than the separate performance statistics of both the teams. Moreover, by not including the performance statistics of both teams into the model, the dimensionality of the model can be limited.

Procedures
Data from all sources was loaded into Python (version 3.7.1). To avoid multicollinearity, a Variance Inflation Factor (VIF) analysis was conducted, with a threshold of 5, using the statsmodels package. BorutaShap was applied as a feature selection technique, using the BorutaShap package. Extreme Gradient Boosting, a tree-based machine learning technique, was applied for both BorutaShap and predicting game outcome (win or lose), using the xgboost package (distributed by SkLearn). A 5-fold stratified cross-validation was used to validate the results. Cross-validation uses a large part of the data to fit the model, in this study 80%, and a small part of the data to test the model, in this study 20% [24]. Each part was thus used 4 times to fit the model and once for validation. Stratified K-fold was used to preserve balance between the frequency of each class of the dependent variable. The StratifiedKFold and cross_val_score packages, distributed by SkLearn, were used for the cross-validation. The average classification accuracy, precision, recall and F1-score of each cross-validation fold is reported, as well as the standard deviation.
The aim of this study was to identify the best predictive variables, therefore, it was opted to apply a tree-based machine learning, which has the advantage of high interpretability and the possibility to apply a theoretically well grounded method such as TreeExplainer to identify the strongest predictors [25]. TreeExplainer uses Shapley values to explain the global model structure, by combining local explanations of each prediction [25]. Using TreeExplainer, it is possible to determine the importance of each feature [26], on a global and local level. TreeExplainer was applied using the shap package. A graphical illustration on the workflow from the dataset to the identification of the best predictors is depicted in Figure 1.

Results
After the removal of variables using VIF and the feature selection procedure using BorutaShap, a total of 13 variables were used during the modelling procedure. The model showed a predictive accuracy of 89.6% ± 3.1%, correctly classifying 516 out of 576 games that resulted in a win or loss (precision: 88.9%; recall: 90.1%, f1-score: 89.5%; Figure 2). In Table 2, the misclassifications are further specified in relation to total goal difference. The most important predictors of the model are presented in Figure 3. As an illustration of the possibilities of local explanations using TreeExplainer, two individual game predictions are presented in Figure 4a,b. As an example, this shows that positive differences between total shots on target from the attacking penalty box between teams are associated with winning, while negative differences are associated with losing. (a) Game "Team C"-"Team D" (0-1), from the perspective of Team D (win predicted).

Discussion
This study aimed to identify the strongest predictive variables of winning and losing in Belgian professional soccer. A broad spectrum of variables was used to build a predictive model. The results showed that more than 89% of the games resulting in a win or loss could be correctly classified. Total shots on target from the attacking penalty box showed to be the strongest predictor. Interestingly, a broad range of variables, including physical indicators such as the distance in several speed zones, the number of accelerations (>2 m/s 2 ) and number of actions >15 km/h showed to be among the most important variables related to game results, as well as contextual variables such as ELO-rating, the total added transfer value of the benched players and match location.
There are different purposes for predicting game outcome. Previous studies relating to the prediction of game outcome in soccer often focused on betting [27]. These studies used historic games to construct a model, which was then evaluated by predicting future soccer games. In our study however, the aim was to identify the strongest predictive variables, so instead of a division between historic and future games, a cross-validation approach was used to evaluate model performance. The prediction accuracy was considerably higher compared to a similar study conducted by [28], who reported classification accuracies of 72.7% and 83.3% when predicting respectively losing and winning in professional soccer using artificial neural networks. A classification accuracy of amply 89% can be considered as high, as it has been shown that chance plays a major role in goal-scoring [29]. The results show that the majority of the misclassifications occur when the final goal difference between two teams is small. Given these small goal differences between teams, and the impact that each goal has on game outcome [29], the occurrence of "lucky winners" or "unlucky losers" is frequent in low-scoring sports [30], and classification accuracies of close to 100% seem improbable.
It was hypothesized that shot-related variables were among the strongest predictors, as previous studies already showed that shot-related variables closely relate to game outcome in soccer [6,8,10,12]. In this study, the total shots on target from the attacking penalty box showed to be the best predictor of winning and losing. Of all shot-related variables, total shots on target from the attacking penalty box was the only variable which was not rejected by VIF or BorutaShap, showing that shot-related variables are closely interrelated. It should be noted that this does not directly indicate that other shotrelated variables can be deemed as unimportant, but that the information entailed by those measures is already captured, or better captured, by other metrics in relation to game outcome. The number of shots on or off target are often reported by sports data providers, as opposed to location, which is usually not reported. It may therefore be useful to either use both total shots on target from the penalty box, or use a metric such as Expected Goals, which also takes shot location into account. Other often-reported metrics such as the total number of passes and the number of successful passes are not among the best important predictors of game outcome, which may also be explained by the informative value to the model in comparison to other variables. This may also explain why Playing Styles-related possessions, such as Direct Play and Counter Attack, are amongst the best predictors, as they may provide more information to the model than total ball possession as an isolated metric.
Physical fitness is deemed as an important factor relating to performance in soccer, however, the role of physical game output in relation to other performance indicators remained to be elucidated [31]. In our study, several physical indicators are shown to be among the best predictors of game outcome in soccer. All variables relating to physical variables were subdivided into the first and second half, as it was previously shown that physical game performances, such as high-speed running [20], the number of acc-and decelerations and the distance in several acc-and deceleration zones decrease at the end of the game [32]. Interestingly, most of the selected predictive physical variables relate to the second half, with the exception of the distance >25 km/h, of which both halves were included into the modelling procedure. This study also shows that total difference in the number of medium accelerations (>2 m/s 2 ) is associated with winning or losing, further confirming the importance of high-intensity efforts in soccer [33]. As indicated by the inclusion of the variable distance between 6-15 km/h in the second half, the ability to maintain the physical capabilities to not only perform high intensity actions, but also low to medium intensity efforts throughout the game seems to be important. With regard to the interpretation of physical performance indicators, it is however important to note that "more" does not always indicate "better", as shown by the inverse relationship of the number of actions >15 km/h and game outcome. This finding is also partly confirmed by the study [34], showing that players covered less distance in the zone between 17-21 km/h during games that were won. As physical game output depends on a myriad of factors, such as ball possession [35], pacing strategy [36], match location and match status [37], physical game output, regardless of its expression, should be viewed in relation to other performance indicators [31,38] and contextual information [37].
Some variables that can be related to attacking play are negatively associated to game outcome. In accordance with previous research [39,40], higher frequencies of crosses are negatively associated with game outcome. Crossing, which can be defined as an airborne delivery of the ball into the opponent's penalty area [41], may therefore be labelled as an inefficient method to create good scoring opportunities [40,41]. It should, however, be noted that playing style depend on the qualities and characteristics of the team [11]. Therefore, the coaching staff may decide to apply a playing style that can generally be characterized as inefficient, because it matches the teams' qualities and characteristics.
In a low-scoring game such as soccer, rare events are often those that lead to success [14]. These events should be captured to accurately predict game outcome, however, these events cannot always be properly quantified, even with more advanced metrics based on tracking data. Actions that are currently difficult to quantify, such as good positioning or the ability to give defense-splitting passes, are however often recognized by clubs, media and/or fans, resulting in higher transfer values reported by sources such as https://www.transfermarkt.co.uk, accessed on 5 March 2021. These actions also help teams to get better game outcomes, resulting in improved ELO-ratings. Including transfer values and ELO-ratings may thus be useful, possibly by partly filling the gap of what cannot be (currently) quantified, also considering the feasibility of obtaining these variables in terms of practical and financial reasons.
Machine learning was used in this study to identify the strongest predictive variables. It has also previously been used in a soccer context not only in relation to game outcome [42][43][44] and tactics [45], but also in relation to training load [46,47] and injuries [48], showing the broad window of applications of machine learning in soccer. Given that game performance [49], training load [50] and injury [51] are all multidimensional, the application of machine learning can be useful since it is particularly helpful when dealing with many input variables. Developments in the area of machine learning, such as TreeExplainer [25], which is not only a strong theoretically grounded method to calculate feature importance [26], but also allows to build visualisations that indicate the direction of the relation of a variable with performance, can be helpful in the translation from science to practice. Illustrations such as those displayed in Figures 3 and 4 can be useful for analysts to show how features impact game outcome.
Future studies should attempt to add more detailed information for several features, for example, total distance in and out of possession or detailed information on the position of crosses. This information can be informative to the model and aid the explanation of results. As data is often provided by sports data companies, these companies should be encouraged to add more detail to the provided data to allow more thorough analyses. The use of "new" features should also be encouraged, to test their added value in relation to other, more established features. It should also be noted that the results from this study cannot fully distinguish whether a variable is the cause of a (un)favourable scoreline, or the effect. To illustrate, losing teams attempt to turn the game and therefore may engage in more high-intensity efforts [34]. The winning side on the other hand, may fall back, allowing less space for losing side to play and perform sprints, while the losing team may apply a more risk-taking strategy, which could result in more counter-attacking play of the winning side [39,52]. Therefore, more research is necessary to gain more insight into the cause-effect relation between performance indicators and game outcome.
The results from our study show which variables can be considered as the best predictors of an accurate model predicting winning and losing in professional Belgian soccer. It provides the direction of the relationship of these variables with winning and losing, also in relation to the other predictors. It was shown that not only shot-related variables, but a broad range of variables are amongst the strongest predictors of winning and losing. As the workflow from dataset to predictive modelling was also described in detail, similar approaches can be used to evaluate the current performance indicators provided to the coaching staff and other stakeholders connected to the team. It seems particularly interesting to look at physical parameters of the second half, given that they are amongst the best predictors of game outcome in soccer. Variables such as ELO-ratings, transfer values, match location and Playing Styles can be useful additions to current approaches used to evaluate game performances. Data Availability Statement: Restrictions apply to the availability of these data. Part of the data was obtained from STATS Perform. Please contact Jan G. Bourgois (jan.bourgois@ugent.be) to inform about the data availability.