Multivariate Exploratory Comparative Analysis of LaLiga Teams: Principal Component Analysis

The use of principal component analysis (PCA) provides information about the main characteristics of teams, based on a set of indicators, instead of displaying individualized information for each of these indicators. In this work we have considered reducing an extensive data matrix to improve interpretation, using PCA. Subsequently, with new components and with multiple linear regression, we have carried out a comparative analysis between the best and bottom teams of LaLiga. The sample consisted of the matches corresponding to the 2015/16, 2016/17 and 2017/18 seasons. The results showed that the best teams were characterized and differentiated from bottom teams in the realization of a greater number of successful passes and in the execution of a greater number of dynamic offensive transitions. The bottom teams were characterized by executing more defensive than offensive actions, showing fewer number of goals and a greater ball possession time in the final third of the field. Goals, ball possession time in the final third of the field, number of effective shots and crosses are the main discriminating performance factors of football. This information allows us to increase knowledge about the key performance indicators (KPI) in football.


Introduction
The identification of performance factors, understood as variables that define some aspect of performance and that help achieve sports success [1], is essential to try to identify the most appropriate behavior patterns that can lead to success [2] and enable the increase and prediction of performance [3,4]. The analysis of the matches will identify those variables related to success [5], and the grouping and combination of these success indicators of different nature will allow the construction of football performance profiles [4,6]. To obtain both the indicators and the performance profiles, the discriminant analysis of the game between teams of different levels is a very useful tool. However, we are facing a sport of a complex and dynamic nature, which makes the identification of these performance profiles a very difficult task [7] because the success of the game can be associated with multiple factors (physical, technical, tactical, . . . ), some of them being unpredictable or uncontrollable, such as arbitration decisions, individual successes or failures of players, match location, type of competition or even chance.
Football research has turned to a multitude of performance indicators [8], and some studies have tried to identify them through the comparative analysis of successful and unsuccessful teams [9][10][11][12][13][14][15][16][17]. Some of these works show conflicting results. This may be caused, among other things, by the type and size of the sample, the study design, the selection of the variables and the characteristics of the sport itself. It may also be because most studies identify the success of the teams based on the match outcome [9,16,[18][19][20][21][22][23]. This discrimination criterion can cause erroneous results because in this sport, in some matches, the team with the best statistical data does not end up getting the victory since in football a single winning play style does not exist. Several teams with different play styles can get similar results. Therefore, it will be necessary to classify the teams, instead of the match outcome, by their position at the end of the season.
To study KPI and performance profiles in football, it would be necessary to perform nomothetic analysis instead of an ideographic one, as the latter would identify the behavior patterns of a unique team and not of the game. It is necessary, therefore, to conduct a longitudinal analysis of all the teams and matches corresponding to one or several regular seasons and classify the teams according to their final position and not based on the match outcome. In this way, the KPI will be more reliable because they will be less mediated by the factors indicated above, and the teams that obtained a higher performance (higher score) at the end of the season can be explained by the fact that they maintained a more effective behavior. Nevertheless, there are few previous studies in this line [11,[24][25][26][27][28].
Sometimes, to carry out this type of works, especially when indirect observation methodology is used, we find a very extensive data matrix with many related variables. In this case, it would be beneficial to reduce this matrix for a simpler interpretation and eliminate possible redundant information. However, if the reduction is carried out under some subjective criteria, there is a risk of losing relevant information. Therefore, we need some tool that allows us to objectively reduce the dimensions of a data matrix without losing important information. For this, PCA can be an adequate statistical technique since its aims are to simplify, reduce and structure the initial information obtained [29]. Its application to the tactical analysis of football has been demonstrated in various works with satisfactory results. Specifically, Gómez et al. [30] carried out a study with the aim of identifying the independent and interactive effects of the game location and the final result in the statistics related to the football game according to the area of the field in which they occurred in LaLiga, from 2003 to 2004 and 2007 to 2008 seasons. They identified different profiles in the teams related to the match venue and the match outcome. In the work of Moura et al. [31] two main components were identified in the 2006 World Cup and showed that shots, shots on goal and percentage ball possession are some variables that discriminate among winning, drawing and losing teams. Winter and Pfeiffer [23] identified four dimensions in the UEFA Euro 2012 (game speed, transition play after ball recovery, transition play after ball loss and offense efficiency), concluding that the transition play after losing the ball and the offense efficiency seem to be factors connected directly with the match outcome, as those were important values for a successful discrimination. In [32], the specific aim of their paper was to investigate which factors were most crucial for the match outcome in the Serie A, concluding that shot on target is the performance indicator of the game. In the work of Ric et al. [33], a comparative study of the spatial individual and collective organization of the players was carried out between the first and second half of the game. In the work of Fernández-Crehuet et al. [34] an index was built to measure the performance of Spanish Football league teams, during the 2016/2017 season, combining five dimensions: economic, fans-related, historical, team quality and the season's data. Authors in [35] managed to identify and differentiate various styles of play of the different teams of the Chenesse Soccer Super League during the 2006 season. One style of play denominated possession, other denominated set pieces attack, counterattacking play and, finally, transitional play. Therefore, we have not found previous works that the PCA have applied to tactically analyze LaLiga teams, during several seasons, and that have determined the level of performance based on the position they occupied in the leaderboard at the end of the season. Nor have they identified and used components to develop a performance model of the teams of different levels. Consequently, we decided to carry out this study to pursue the following aims: the first aim of the present study was to reduce the size of a large database and group it into new categories without losing information, through the PCA. The second aim was to perform a comparative and predictive performance analysis among the best and bottom teams of LaLiga, using the KPI of each group.

Sample
In order to carry out this study, 1415 records corresponding to the 2015/16, 2016/17 and 2017/18 seasons of LaLiga have been analyzed. These records belong to the best and bottom teams, ranked according to their final score at the end of the regular league (best teams: the best six teams, qualified in the UEFA Champions League and the Europa League; bottom teams: the three lower teams that descended from a category and the next three worst ranked). Data were obtained from the analysis platform InstatScout and analyzed post event. Instatscout (www.instatscout.com) (accessed on 1 April 2019) is a private platform dedicated to assessing the performance of teams in different world leagues. The information cannot be considered either personal or intimate, as the research consisted solely of naturalistic observations in public places, and it was not anticipated that the recordings would be used in a manner that could cause personal harm. According to the Belmont Report [36], the use of public images for research purpose does not require informed consent or the approval of an ethical committee.

Data Reliability
To ensure the reliability of the data, five randomly selected matches were coded by the authors of this study and then compared with those provided by InStat. The Kappa (K) values obtained ranged from 0.92 to 0.97.

Procedure and Statistical Analysis
To analyze the game of both groups, a descriptive, comparative and predictive analysis of the performance of the variable "EFFECTIVENESS" was carried out and defined as Goals Scored + Shots on target Shots The analysis started checking whether the set of the 57 used indicators correctly differentiated the best and bottom teams. For this, a linear discriminant analysis (LDA) was applied, which is a well-established machine learning technique for categories. Its main advantages are that the model is interpretable and that the prediction is simple.
Secondly, in each group of teams, a Principal Component Analysis (PCA) was carried out to reduce the set of indicators and work with a more manageable size, reducing the multicollinearity problem. This technique allows to transform the original information into a new set of variables, called PC, without losing any information. The first transformed PC captures the greatest amount of information, and each subsequent PC explains a reduced amount of information. For the calculation of the main components the variables were standardized (mean 0 and standard deviation 1). When calculated on standardized variables, the main components are eigenvectors that are taken from the correlation matrix, and as many different main components as available variables can be obtained. The number of significant principal components (PC) was determined by the conventional criteria, but only those that cumulatively accounted for ≥70% of the explained variance were selected [29]. An additional benefit to PCA is that each PC is uncorrelated, so each one captures distinct information within each individual's data set. Shots, 3.
Effectiveness shots goals and shots on target.
Attacks with shot set pieces attacks, 8.
Ball possession in own half, 10. Ball possession in opponent half, 11 Following this, the orthogonal varimax rotation (Varimax method) performance by Kaiser [40] was performed to determine the contribution of the original categories to the variance explained for each extracted PC as well as improving interpretability. The original categories that demonstrated PC loadings (PC L ; i.e., eigenvectors of the covariance matrix) that exceeded ±0.70 were considered indicative of a well-defined relationship with the extracted PC [41][42][43]. The loadings can be interpreted as the weight/importance of each variable in each component; therefore, they help to know what type of information each of the components collects. The purpose of the calculation of the loadings is to identify the linear combinations that best represent the variables X 1 , . . . , X p . Sean (Z 1 , Z 2 , . . . , Z M ), where M < p is linear combinations of the original p variables, that is Zm = ∑j = 1pφjmXj, where φ1m, φ2m, . . . , φpm, φ1m, φ2m, . . . , φpm are the constants, or loadings, of the main components (for example, φ 11 would correspond to the first loading of the first main component).
To finish, an analysis of variance (ANOVA) was carried out to check the differences between the different main components. By means of multiple linear regression, a pre-diction model was also constructed for each group, formed by the explained variable "EFFECTIVENESS", and the main components found in each group.
This model will allow us to identify the set of variables that have greater influence on the performance of each group and check if there are differences between the styles of play between both groups of teams.

Results
The results obtained from the linear discriminant analysis (LDA) were 85.63% well classified and 14.36% poorly classified, which reveals that the indicators used correctly classified the teams as best and bottom. Figure 1 shows the screeplot, the eigenvectors and the accumulation of the variance explained by each PC. The eight eigenvectors produced by the PCA explained 70.1% of the total variance, and its eigenvalues were higher than 1 (see Table 2).    The component loadings after rotation is illustrated in Table 3. Applying the Varimax rotation method maximizes the variance of the matrix of charges so that the values are more interpretable. The "rotated" matrix provides the loadings of the main components, and each column contains the vector of loadings for each main component.

Bottom Teams PCA Results
In bottom teams a screeplot was generated to represent the eigenvectors ordered from highest to lowest (Figure 3).

Bottom Teams PCA Results
In bottom teams a screeplot was generated to represent the eigenvectors ordered from highest to lowest (Figure 3). Taking into account the information of the accumulated variance and that of the eigenvalues, the first nine components that explain 70% of the variance were used (Table 4). Table 5 shows the bottom teams' component loadings after rotation.   Taking into account the information of the accumulated variance and that of the eigenvalues, the first nine components that explain 70% of the variance were used (Table 4). Table 5 shows the bottom teams' component loadings after rotation.

Best Teams ANOVA and Linear Regression Model
An analysis of variance indicated that there were significant differences in the eight main components (Table 6). It was verified that the residues were distributed randomly around 0. The Shapiro-Wilk test was performed to check the normality of the main component residuals. A p-value = 0.337 was obtained. Therefore, they follow normality. For the homoscedasticity of the residues the test of Breusch-Pagan was used with a value of BP = 1.8871, df = 2, and a p-value = 0.389, which concludes that there is no evidence of lack of homoscedasticity.
Autocorrelation was studied by means of the DW statistic that showed a result of 1879, with a p-value = 0.959. Therefore, there is no evidence of autocorrelation. The eight main components were entered into a linear regression model (Table 7) to predict the explained variable "EFFECTIVENESS". It can be seen how all the PCs provided different information to the model, with PC3 being the one with the greatest weight (0.76260) and PC1 with the lowest weight (0.03851).

Best Teams ANOVA and Linear Regression Model
An analysis of variance indicated that there were significant differences in the eight main components (Table 6). It was verified that the residues were distributed randomly around 0. The Shapiro-Wilk test was performed to check the normality of the main component residuals. A pvalue = 0.337 was obtained. Therefore, they follow normality. For the homoscedasticity of the residues the test of Breusch-Pagan was used with a value of BP = 1.8871, df = 2, and a p-value = 0.389, which concludes that there is no evidence of lack of homoscedasticity.
Autocorrelation was studied by means of the DW statistic that showed a result of 1879, with a p-value = 0.959. Therefore, there is no evidence of autocorrelation. The eight main components were entered into a linear regression model (Table 7) to predict the explained variable "EFFECTIVENESS". It can be seen how all the PCs provided different

Bottom Teams ANOVA and Linear Regression Model
In Table 8 we can see how significant differences have been identified among the nine main components of the bottom teams. Residuals must be distributed randomly around 0. The Shapiro-Wilk test affirmed the normality of the residuals (p-value = 0.262). The Breusch-Pagan test values were BP = 1.8871, df = 2, and a p-value = 0.389, from which it follows that there was no evidence of a lack of homoscedasticity. The statistic values D-W = 1989, p-value 0.972, showed no evidence of autocorrelation.

Discussion
To identify the indicators that influence football performance we perform a comparative analysis between teams of different levels of success, but sometimes we find a set of data with many related categories; therefore, the application of techniques that reduce the quantity of data could be useful. In this work we have considered reducing the dimensions of a data matrix without the loss of relevant information, using PCA. Subsequently we have used these PCs to try to identify the difference in performance between the best and bottom teams of LaLiga.
In both groups, the Passes PC is denominated this way because most of the categories that constitute it refer to the number of passes and the time of possession. Challenges PC received this name because it included all types of challenges. The Attack effectiveness PC collected categories of the offensive phase, especially related to goals, shots and the effectiveness of shots. The Shots PC mainly included categories related to goals, shots, possession and passes. The Dribbles PC was mainly constituted by categories referring to dribbling, tackles and challenges. The Tackles PC was related to dribbling, challenges, tackles and lost balls. The Offensive transitions PC received this name for being related to recoveries, interceptions and counterattacks. The Possession PC, in the group of successful teams, is the one that showed a worse definition since it is made up of categories with less relation between them. In the bottom teams the 1vs1 PC included all dribbles and tackles. Fouls opponent PC is constituted by varied categories, being the heaviest ones the fouls opponent and, finally, Possession PC is also formed by different categories, the time of possession being the most important. Therefore, the PCA was shown, as in some previous works [23,31,[33][34][35][45][46][47], as a good statistical technique, when we intend to reduce large data sets that have many interrelated variables, allowing us not only to speak of individual performance indicators, but of a set of related indicators.
If we use the PCs to compare the game of both groups, the first difference we observe is that, to explain the same percentage of variance, for the best teams group we need eight PCs, and for the bottom teams we need nine PCs. In both groups, both the category constituted from PC and called Passes, as well as Challenges, were those that allowed explaining the highest percentage of the variance. The Passes category had a slightly greater weight (27%) in the best teams group than in the bottom teams (23%) (Tables 2 and 4). On the other hand, Challenges showed a similar weight in both groups (15% and 14%). However, the loadings of each PC were not exactly the same for each group (Tables 3 and 5). Thus, for Passes PC in the best teams group, the most important categories were passes, passes accurate, passes accurate left and passes accurate right. For the bottom teams, the highest weight categories were possession, passes, passes forward, passes left, passes right and passes forward accurate. Therefore, we can indicate that successful teams are characterized more by the efficiency of the passes than by the number of passes executed. That is, they have a greater number of successful passes than lower level teams. These results coincide with some previous works [22], but they analyzed 2014 Brazil FIFA World Cup and used a logistic regression. For Challenges PC we have also found some differences. It can be seen how, for the best teams, the attack challenges had greater weight; however, the defensive challenges were the ones most relevant for bottom teams. This circumstance can be explained because the bottom teams are characterized by staying longer in the defensive phase, executing many more defensive than offensive actions. Previous work also coincides in indicating that the successful teams show higher averages of offensive variables, and unsuccessful teams show higher averages of defensive variables [48].
Another difference that we can see in terms of PC formation is that in the best teams the PC called Tackles is formed, consisting mainly of the categories dribbling, challenges, tackles and lost balls. In the bottom teams the 1vs1 PC and fouls were constituted but did not appear in the other group. In spite of these differences we can appreciate that both the components constituted for both groups, as well as the categories and the weight of these in each component, were very similar. This circumstance leads us to think that in high level football the differences between the teams are minimal, and their success or failure may be explained by the individual performance of their players.
The results of the linear regression model (Tables 5 and 7) allow us to identify which PCs have the greatest influence on the performance of both groups of teams. For this, a prediction model of the category "EFFECTIVENESS" was built, both for the best and for the bottom teams. The linear regression model of the best teams group, ordering the PCs from highest to lowest weight, was constituted as follows: Attack effectiveness (0.76260); Offensive transitions (0.40160); Shots (0.36481); Possession (0.33451); Dribbles (0.22498); Passes (0.03851); Challenges (−0.23416) and Tackles (−0.40160). In the bottom teams the order was as follows: Dribbles (0.55955); Possession (0.51367); Shots (0.30873); Passes (0.17582); Challenges (0.07051); 1vs1 (−0.18340); Fouls (−0.54061) and Attack effectiveness (−0.84295). We can see how in best teams, the PC that offered a greater influence on the prediction of this category was Attack effectiveness. The number of goals, a greater ball possession time in the final third of the field, a greater number of effective shots and crosses allow to increase the performance in best teams. This information is essential for technicians since, if they manage to improve the performance of their teams in these elements of the game, they will increase their offensive performance. The information provided by the number of goals is trivial since it is obvious that scoring more goals implies increasing offensive performance, but the other indicators referring to ball possession zone, effective shots and crosses do offer transcendent information. These results are corroborated by the works of [9,10] who indicated that successful teams have longer-term possessions in the middle of the offensive field than the defensive one. The works [19,22,49,50] indicate that successful teams show greater effectiveness in shooting, also ratify in their work that making a greater number of crosses increases the chances of winning the matches. In contrast to the cited studies, in our work we have obtained similar results using a different method, specifically through a data mining technique. Winter and Pfeiffer [23] also reached the same conclusion in their work, indicating that there is a relationship between offense efficiency and success, but they analyzed UEFA Euro 2012 and considered success as the match outcome.
Following the results of the linear regression model, we can indicate how the main differences in the prediction of performance of both groups occur in PCs offensive transitions, tackles, challenges, dribbles, fouls opponent and 1vs1. Offensive transitions play a more important role in the best teams than in the bottom teams. Thus, in the best teams, performing a greater number of recoveries, interceptions and counterattacks, that is, dynamic offensive transitions through counterattacks, would increase their performance in the game. This circumstance was also pointed out by Tenga et al. [40]. These authors analyzed the Norwegian league, and by means of a multiple linear regression, they obtained that the proportion of goals scored during counterattacks (52%) was higher than during elaborate attacks (48%). Therefore, the offensive game seems to be more efficient against a disorderly defense. This information is very important for the coaches, who should focus their training on these game situations, both in attack and defense, to try to improve their performance in both phases of the game.
In the best teams the Tackles and Challenges PCs negatively influenced the offensive performance. This may be due to the fact that these are more typical behaviors of unsuccessful teams, as indicated above [48].
In bottom teams it was appreciated how increasing the number of successful dribblings would increase performance. This result coincides with that of the work of Harrop and Nevill [21] who found that the number of dribbles is correlated with performance. The PC Fouls opponent also showed a strong negative influence on the performance of bottom teams and that these teams showed fewer effective attacks than the best teams.
We have achieved the aims set and the sample used, as these are the matches of three competitive seasons, allowing us to generalize the results. The main contribution and novelty of this work is that we have carried out a longitudinal tactical analysis of LaLiga teams, using the combination of factor analysis and linear regression. However, we believe that the differences found in the constitution of the different PCs have not been as satisfactory as we would have liked. We believe that this may be due to the design used, in our case we have found the PCs for each group of teams separately and, subsequently, we have tried to build a probabilistic model with the detected PCs. In future works we should propose a design in which we find the main components for both groups and then build a separate model for each group. In addition, since the goals scored and received did not have a significant contribution to the main components, in the future it could be considered to eliminate these variables from the analysis because this approach may be biasing the same.
The results of this work offer information to the technicians, about what are the KPIs in football and the game pattern of the best teams, being able to compare the latter with that of their own teams, and thus, to be able to make the appropriate modifications, to increase performance.

Conclusions
The realization of this work has allowed us, with the use of the PCA, to reduce a dimension of data without losing relevant information. We have been able to identify the KPI of the best and bottom teams, and we have identified the main differences between both groups.
Best teams are characterized and differentiated from bottom teams in the realization of a greater number of successful passes and in the execution of a greater number of dynamic offensive transitions.