A Comparative Study of Logistic Models Using an Asymmetric Link: Modelling the Away Victories in Football

The target of this paper is to study the relevant factors affecting the victories away from home of football teams in order to fit the probability of winning an away match. The paper addressed the following research issues: (a) Is the identification of the significant variables underlying the results plausible? (b) Can information of these factors increase the probability of winning away from home and assist coaches in their decisions? Empirically, it is shown that there are more home victories and draws than away victories in the professional football leagues in Europe and this fact has to be taken into account. Thus, the classical logistic and Bayesian regression models do not seem to be adequate in this case and an asymmetric logistic regression model is therefore considered. This paper analyses 380 games played in the First Division of the Spanish Football League during the 2013–2014 season. Asymmetric logistic regression from a Bayesian point of view is chosen as the best model. This model detects new relevant factors undetected by standard logistic regressions. In view of the paper’s findings, various practical recommendations were made in order to improve decision-making in this field. The Asymmetric logit link is a helpful device that can assist coaches in their game strategies.


Introduction
In the middle of the 1990s, most of the European football leagues replaced the old point score system (two points for a victory and one point for a draw) with a new one (three points for a victory and one point for a draw). The English Premier League was the first one to adopt this system in 1981 (see [1,2] for details). The new system of points was not only applied in the first division, but also in the rest of the categories of football competitions in all countries. The new rule was introduced in the World Cup and the European qualifying in 1994 and one year later in Spain and the Champions League. In the words of [3], the motivation behind the change was to avoid boring draws.
Some works related with the effects caused by the transition from the 2-1-0 to the 3-1-0 award system in football have been published in the last few decades. See, for example [4][5][6][7], among others. The consequences of the new point score system are not clear, but, at least in Spain, most teams play in order to get the victory not only in their home location, but also in away games. In this sense, in the past, teams playing a football match in an away place were satisfied with getting a draw, at least in Spain and Italy. These days, most teams are only focusing on getting three points from the match because the difference between getting a victory and a draw is two points instead of one point, as it was in the past. In the long term, a large number of draws would lead the team to the lowest position in the classification, and, therefore, the probability of avoiding the relegation decreases. Figure 1 shows the away victories in four of the most important European football leagues (Premier League, Bundesliga, Italy Football League and Spanish Football League) from 1993-1994 to 2015-2016 seasons. It can be seen that there is a growing trend in the away victories from 1993. Therefore, it seems that it is important to play to get the victory, instead of playing to get a draw, even when the football teams play as visitors. However, the target of this paper is to study the relevant factors affecting the away victories of football teams in order to fit the probability of winning an away match. In this way, in our experience, no preceding studies have supposed the situation in which the matches have many more home victories and draws than away victories. A classical logit model can be used to analyse the factors that determine sporting achievement, but sometimes the individual results are more clearly related to one category than to the another. This is the case shown in this paper, in which there are more drawing and winning matches as a local team in the final results of the games, therefore, the asymmetric logit model can improve the estimations. In this context, [8] applied a Bayesian procedure applying a skewed link in their analysis of binary response data when one response is much more frequent than the other. Similarly, [9] used a skewed logit link for estimating the fraudulent conduct reflected in a Spanish database of insurance claims. [10] applied the asymmetric logit model to analyse infection rates in a General and Digestive Surgery hospital department. [11] studied the risk variables underlying automobile insurance claims taking into account the asymmetry of the database.
The formal aspects of the different logistic regression models considered in this work are developed in Section 2. The description of the database is shown in Section 3. Section 4 discusses the results, and conclusions and future lines of research connected with this work are presented in the last Section.

Frequentist Estimation
When research deals with binary outcomes, the logit and probit models are the highest popular models in regards to this case. A binary response model is a regression model in which the dependent variable y is a binary random variable that takes only the values zero and one. In our case, the variable y = 1 if a match ends with an away victory while y = 0 otherwise, that is, the match ends with the visiting team getting a draw or a defeat. In this article, we use the logit model in order to estimate the probability of an away victory in football competition given a set of characteristics of the event; that is, given the predictor X, we estimate Pr(1|X = x), i.e., the conditional probability that y = 1 given the value of the predictor. As is known, the logit specification is a particular instance of a generalized linear model (see [12], chapter 12, for details). On the other hand, the logistic link function is a moderately not confusing alteration of the prediction curve and also yields odds ratios. Both characteristics make it well-received among researchers in front of the probit regression. The standard logistic distribution has a closed form expression and a shape notably similar to the normal distribution. Logit models have been used widely in several fields, including medicine, biology, psychology, economics, insurance, politics, etc. Recent applications of the linear logit specification in statistics in sports are [13,14] in basketball, [15,16] for football, among others.
Specifically, the logit is defined as follows. For observation t in a sample of size n, let y t , t = 1, 2, . . . , n, a binary variable taking the value of 1 with probability and 0 with probability 1 − p t , where β = (β 1 , · · · , β k ) is a k × 1 vector of regression coefficients, which represents the effect of each variable in the model and it should be estimated. Finally, x t = (x t1 , ..., x tk ) is a vector (explanatory variables) of known constants, including an intercept, the vector of covariates for the match t in our case. The regression is therefore modelled by assuming that p t = F(x t β), where F is the inverse of the standard logistic cumulative function (link function). Recall that the probability density function of the standard logistic distribution is symmetric about 0. In summary, the logit specification adopts the following form: Thus, the likelihood is given by where F(s) = 1/(1 + e −s ), −∞ < s < ∞ is a symmetric function with respect to zero. The β parameters are usually estimated by the maximum likelihood method. In this way, the model gives the probability of each visiting team winning. The next step is to take into consideration a cut-off for determining whether a match will end with an away victory or not. The classical logit (frequentist approach) model is implemented in most of the standard statistical packages as Mathematica (Champaign, IL, USA), STATA (Texas, TX, USA) and R (Vienna, Austria), among others. We have estimated the basic logit model using STATA 14.1 econometric software.

Bayesian Estimation
In contrast to the frequentist approach, the Bayesian approach has gained a lot of popularity in the last decades. In the past, the main motivation for using the standard logit regression model was basically by computational effort. Software for implementing other methodologies became widely available in the last few decades due to the advances in computational sciences. From the pioneering work of [17] (first published in 1971), the applications of Bayesian methodology in econometrics theory have increased considerably.
In the Bayesian approach, the β parameters are considered to be random variables assuming non-informative and centered normal prior distributions, making the comparisons with classical results easy. The Bayesian methods use the data and the prior knowledge to obtain the estimations and these results usually are more accurate than those derived under classical methods.
Bayesian inference for logit studies satisfies the standard mechanism in Bayesian analysis consisting of the likelihood function of the data, the prior distribution over the unknown parameters and the use of the Bayes theorem to compute the posterior distribution of the parameters.
The set of unknown parameters is represented by the vector β = (β 1 , . . . , β k ). Thus, the logit Bayesian model can be specified as follows: where π(·) is the prior distribution of β. The selection of the prior distribution can involve informative prior distributions if the researcher knows something about the parameters, or non-informative prior if there is little information about these coefficients. A problem arises when informative prior distributions are chosen: the information must be given on the logit scale, i.e., on the β parameters directly. We suppose as it is usual that the parameters of the logit models follows a normal distribution, where µ is zero, and σ is usually chosen to be large enough to be considered as non-informative.
By combining the prior assumption with the likelihood in (2), we obtain the posterior distribution for the parameters β, which is proportional to Multiple

Bayesian Asymmetric Estimation
The use of a symmetric link function as developed in the frequentist and Bayesian logit specification models above is recommended for binary response data in which the frequency of both responses are similar. If one response is much more frequent than the other, an asymmetric link is preferable. Figure 2 shows the home victories and draws versus the away victories in the four most important European football leagues from the 2012-2013 to 2015-2016 seasons. It can be seen that the 0 (home victory and draw) response is much more frequent than the 1 (away victory) and therefore an asymmetric link function is preferable in order to explain the conditional probability Pr(1|X = x). In this case, application of the above classical models can lead to model misspecification, a misinterpretation of the marginal effects and unidentified predictors. A commonly adopted asymmetric link function is the complementary loglog link function, which has a fixed negative skewness and therefore does not have the possibility to incorporate positive skewness. Several attempts to overcome this problem appear in the statistical literature. Some of them are [8,19,20], among others.
The model proposed by [20] includes the complementary log-log link and the probit models. However, Stukel's models yield improper posterior distributions under an improper uniform prior for β (see [8] for details). From the asymmetric point of view, [8,21] considered a procedure based on data augmentation supposing that is the skewness coefficient and so the asymmetry of the logistic model is estimated by δz t . If δ > 0, the probability of p t = 1, the probability that the tth match ends with an away victory, increases. On the other hand, if δ < 0, the probability of ending with a draw or a defeat of the visiting team increases.
The new Bayesian asymmetric logit model can be written as follows: where π(β, δ) is a bivariate prior distribution for (β, δ). The symmetric logistic model (3)-(4) is just a particular case of model (6)- (7) when there is no skewness (δ = 0). We assume that z t and ε t are independent and that F and G are the standard logistic and half-standard normal cumulative distribution functions, respectively. The last one is given by g(z) = √ 2/π exp(−z 2 /2), z > 0. Likelihood function is given by Again, we assume that the prior distribution of the parameters is normal and non-informative. Thus, β j ∼ N(0, σ 2 j ), ∀j = 1, ..., k, and δ ∼ N(0, σ 2 δ ), supposing σ j > 0, ∀j = 1, ..., k, and σ δ > 0 are sufficiently large, pointing out the absence of prior knowledge about the parameters of interest and facilitating the comparison with the classical model. The values of the variances considered are σ 2 j = 10 8 , ∀j = 1, ..., k, and σ 2 δ = 10 8 . The posterior distribution for the β and δ parameters is proportional to Again, we use WinBUGS to solve in an approximate way the properties of the marginal posterior distributions for each parameter.

Description of Database
This paper analyses 380 matches played in the First Division of the Spanish Football League, La Liga, during the season 2013-2014 in order to analyse the factors that might have affected the probability of winning an away match. We consider four sets of variables: those related to the game statistics (HS, AS, AF, HC, AC, HY, AY, HR and AR), a game variable we term DERBY, non-sports variables (BUDH and BUDA) and those associated with the referee (INTERNATIONAL and ACIENT). This dataset and others may be downloaded from [22]. These variables were chosen by applying the Bayesian model averaging (BMA) tool from 262,144 competing models and after testing the absence of collinearity under the variance inflation factor (VIF) criterion.
The variables included in the game statistics category were HS and AS, the total shots of the home and visiting teams, respectively; AF, representing the fouls committed by the visiting team; HC and AC, the number of corners for each team; and, finally, yellow or red cards shown to the home or visiting teams, HY, AY, HR and AR. There is one game variable, DERBY, which takes the value 1 when the match is played between teams from the same region or city, or between the strongest teams in the competition, and 0 otherwise. The non-sports variables, BUDH and BUDA, represent the budgets of the home and visiting teams. Finally, the variables related to the referee: the international experience, INTERNATIONAL, which was scored as 1 if he had such experience, and 0, otherwise; and the number of years of experience in the first division, ACIENT.
A brief description of these variables is shown in Table 1.

HR
Red cards shown to the home team.

AR
Red cards shown to the away team.

Game variable DERBY
Match played between teams from the same city or region or between the strongest teams in the league.

Empirical Results
In this section, we check that the non-informative Bayesian symmetric and the frequentist estimations of the logistic model provide similar results in terms of fit and coefficient estimates. Then, we compare these estimations with those obtained by the Bayesian asymmetric logistic model and we observe that this last model improves the overall fitting and detects new relevant variables.
To evaluate the quality of fitting, we propose three different measures: (i) the percentage of correct fittings calculated by considering the estimates probabilities; (ii) the Akaike information criterion (AIC) defined as AIC = 2(k − log( (y|x, β))); and (iii) the deviance information criterion (DIC), given by DIC = −2 log ( (y|x, β)). Here, β are the estimated parameters obtained usually by maximum likelihood estimation. Both statistics measure the relative quality of statistical models for a given set of data. The idea is that models with smaller AIC and DIC should be preferred to models with larger AIC and DIC. See [23,24] for details. We estimate the above-mentioned probability for match t as for the Bayesian symmetric and the frequentist logistic models, and for the Bayesian asymmetric logistic model. The posterior distributions for Bayesian models were simulated using WinBUGS. A total of 500,000 iterations were carried out (after a burn-in period of 100,000 simulations). Three different chains were carried out and the convergence was evaluated for all parameters using tests provided within the WinBUGS Convergence Diagnostics and Output Analysis (CODA) software. The source codes of Bayesian estimations are available upon request from the authors.
The results of estimating the frequentist and the non-informative symmetric Bayesian models are shown in Table 2. In the light of these results, the following significant variables regarding the game statistics and non-sports variables were obtained: shots of the visiting team and red cards shown to the home team, AS and HR; and the home and away budgets, BUDH and BUDA. In relation to the signs of the coefficients, they were positive except for the BUDH, which means that the expectation of winning an away match decreases with the home team's budget. It seems coherent under the idea that the higher the budget of the local team, the lower the probability of victory for the visitor. The high level of significance that the red cards shown to the home team have in the victory of the visiting team should be noted. The results are similar for both models because the prior information is non-informative in the Bayesian estimation. However, using the Bayesian approach, a new variable arises, INTERNATIONAL, which implies that, if the referee has international experience, the expectation of the victory of the visiting team increases, i.e., non international referees decrease the probability of winning for visiting teams.
The results for estimating the Bayesian asymmetric logit model are also shown in Table 2. We observe that the estimated coefficients differ considerably from those of the previous models, although the signs remain the same. This difference is further accentuated in the estimation of the constant. In the symmetric models, the estimated constant may contain part of the asymmetry effect made apparent in the asymmetric model. It may be seen that the new estimation, using the asymmetric Bayesian approach, improves the results, which is strengthened with the values of the AIC and DIC.  also shows that the accuracy, i.e., the proportions of victories and non-victories (defaults or draws) that were correctly classified by the models, is around 73.68% for the frequentist model (corresponding to 40 away victories and 240 away defeats or draws) and 71.58% for the symmetric Bayesian model (corresponding to 72 away victories and 200 away defeats or draws). The threshold probabilities used to fit an away victory was the sample frequency of victories, 0.302. As we can observe, the Bayesian symmetric model fits the away victories better but the away draws and defeats worse. Nevertheless, the best result is taken from the asymmetric Bayesian logit estimation, which fits 100% of the away victories. Obviously, these results are explained by the increase in the probability of fitting the y i = 0 cases induced by the asymmetric model, since δ was negative. Figure 3 shows the receiver operating characteristic (ROC) curve for the frequentist, symmetric and asymmetric Bayesian models. The c-statistics are 0.725 for the frequentist model, 0.722 for the symmetric Bayesian model and 1 for the asymmetric Bayesian model. Table 3 shows the results obtained by the restricted models, i.e., the models including only the significant variables obtained in the previous estimations. These results remark the robustness of the estimations obtained in Table 2. The signs, significant levels and percentages of correct fitting remain stable.

Conclusions
In this paper, we use a novel econometric methodology to increase the available quantitative mechanisms, the asymmetric logistic regression. In binary response data, the application of a skewed link function is suggested when one category is much more recurrent than the other, as it is usually the case in football datasets, where the away victories response is much less frequent than the home victories and draws responses.
Specifically, we present the asymmetric logistic regression to study the impact of the main factors on the probability of winning an away match. To our knowledge, this tool has not been applied in football studies. Through this new methodology, the model detects new relevant factors to explain the away victories of the football teams that have not been detected by the standard methodologies. In this way, the team staff would have a potential tool to replicate matches more efficiently considering these important factors and estimating the probability of winning. The results lead to the consideration of practical recommendations on coach's decision-making such as, for instance, playing strategically as visitors or taking the initiative in attacking what favours shouting on goal, or forcing the rivals playing hard to be issued with red cards. It seems clear that if coaches want to improve teams' performances, they should behave in such a way whereby the management of the available resources allows them to maximize the winning probability of their teams by paying special attention to these key factors.
Taking all of these results into account, it is clear that the asymmetry has to be included into the logit model. As future research lines, panel data including random effects for a database of several seasons can be used, keeping in mind the asymmetric link. Future studies might also be addressed to predict the probability of the away victories in the next period (season), considering the asymmetric information to improve the quality of this prediction.