Forecasting Credit Ratings of EU Banks

: The aim of this study is to forecast credit ratings of E.U. banking institutions, as dictated by Credit Rating Agencies (CRAs). To do so, we developed alternative forecasting models that determine the non-disclosed criteria used in rating. We compiled a sample of 112 E.U. banking institutions, including their Fitch assigned ratings for 2017 and the publicly available information from their corresponding ﬁnancial statements spanning the period 2013 to 2016, that lead to the corresponding ratings. Our assessment is based on identifying the ﬁnancial variables that are relevant to forecasting the ratings and the rating methodology used. In the empirical section, we employed a vigorous variable selection scheme prior to training both Probit and Support Vector Machines (SVM) models, given that the latter originates from the area of machine learning and is gaining popularity among economists and CRAs. Our results show that the most accurate, in terms of in-sample forecasting, is an SVM model coupled with the nonlinear RBF kernel that identiﬁes correctly 91.07% of the banks’ ratings, using only 8 explanatory variables. Our ﬁndings suggest that a forecasting model based solely on publicly available ﬁnancial information can adhere closely to the o ﬃ cial ratings produced by Fitch. This provides evidence that the actual assessment procedures of the Credit Rating Agencies can be fairly accurately proxied by forecasting models based on freely available data and information on undisclosed information is of lower importance.


Introduction
Credit Rating Agencies (CRAs) have been around for more than 150 years. Their role progressed from simple information collectors to quasi-official evaluators of credit risk throughout the modern global financial system. CRAs were originally paid by potential investors to compile financial information and data at a time when such a service was too difficult and costly. Nonetheless, after the 1929 big market crash CRAs started to play a more formal role in the financial system. The stricter rules that were imposed by regulators with the Glass-Steagal Act in the mid-1930s limited banking, insurance and other financial institutions to only invest in "investment grade" securities, assessed by the CRAs. Since then, we have seen a growing reliance on CRAs ratings as they are increasingly incorporated in private contracts, investment guidelines for pension funds, endowment funds and other private entities that all came to rely on these CRAs ratings.
In the aftermath of the 2008 global financial crisis, the role of CRAs evolved to an increasingly important albeit a questionable one; they provide important financial information to market participants, mainly by issuing ratings on the probability of default for specific debt issuers. In recent years, there is an increased interest in the credit ratings process and specifically on the actual criteria used by the

Literature Review
While bank ratings are used extensively as explanatory variables in the economic literature, the nature of ratings per se remains largely ill-examined. A paper closely related to this study is Gogas et al. (2014) who examined the ratings of the Fitch Rating Agency for the case of 92 U.S. banks. The authors used ordered logit models to forecast bank credit ratings based on publicly available financial statements of the banks. Their empirical findings suggested that almost 84% of the actual ratings can be matched based on publicly available information. Bissoondoyal-Bheenick and Treepongkaruna (2011) analyzed the quantitative determinants of bank ratings, provided by Standard & Poor's, Moody's, and Fitch for U.K. and Australian banks. They instead based their analysis on an ordered probit model and found that accounting variables from the financial statements of banking institutions had more explaining power in identifying banks' ratings than macroeconomic ones. Pagratis and Stringa (2007) conducted an ordered probit analysis in order to evaluate the potential linkage between Moody's bank ratings and bank characteristics such as provisions, profitability, cost efficiency, liquidity, short-term interest rates and bank-size.
From a different perspective, Papadimitriou (2012) explored the clustering properties of 90 financial institutions using a correspondence analysis map. The goal was to correspond clustering groups with ratings from Fitch. The empirical findings support a correspondence between clusters and ratings, though the regions corresponding to the ratings are highly overlapped. Credit ratings have also been explored with the use of machine learning methods. Ravi et al. (2008) argued that almost 83.5% of bank failures can be foreseen based on a Support Vector Machine (SVM) model that utilized the information of 54 financial variables on a sample of 1000 banks, over the period 2005-2008. Although the issue of identifying the exact structure of credit ratings for banks has not been studied to a large effect by the relevant literature, significant relevant papers can be found in the area of predicting bond ratings. Ederington (1985); Pinches and Mingo (1973); Belkaoui (1980) used statistical methods such us logistic regression and multivariable discriminant analysis (MDA) to predict bond ratings. Based on alternative sets of variables the prediction results vary in accuracy between 50% and 70%. Many studies on bond credit rating prediction build forecasting neural networks models (Dutta and Shekhar 1988;Surkan and Singleton 1990;Kim et al. 1993) that are more accurate than typical statistical methods. Moody and Utans (1995) used neural networks to forecast corporate bond ratings based on the ratings of S&P. Using 10 input variables they correctly forecasted 85.2% of the actual ratings. Maher and Sen (1997) compared neural networks to logistic regression models in forecasting bond ratings for the period 1990-1992. The most accurate model achieved 70% on a holdout sample. Kwon et al. (1997) compared ordinal pairwise partitioning (OPP) with back propagation and conventional neural networks for bond ratings of Korean firms. Using 126 financial variables for the period 1991-1993 they achieved 71-73% via neural networks with OPP and 66-67% via conventional neural networks. Huang et al. (2004) compared back propagation neural networks (BPNN) to SVMs in forecasting corporate credit ratings for the U.S. and Taiwan. The most accurate model was a linear SVM model achieving an 80% of correct bond classification. He et al. (2012) examined the relationship between ratings and the business cycle on mortgage-backed securities (MBS) spanning the period from 2000 to 2006 and their respective ratings from Moody's, S&P and Fitch. The idea was that large financial institutions will persuade CRAs to issue a higher rating than the one dictated by the rating methodology. This discrepancy should be visible when the price of securities sold by big issuers drops more than the price of securities sold by small issuers (keeping everything else fixed). The empirical findings provided evidence in favor of a favorable rating for larger issuers in comparison to small ones, especially during the market boom period of 2004-2006. Hau et al. (2013 extended the previous setting to the banking sector, using a cross-sectional sample of 39,000 banking institution ratings for the period 1990-2011 from Moody's, S&P, and Fitch. The authors concluded that large banks systematically received higher ratings than they should have actually received. An important factor in this favorable rating scheme is the provision of large securitization to CRAs that affects the final outcome of the rating. This phenomenon is more prevalent during economic booms, when the risk of reputational loss is lower. The erosion of the rating system due to the aforementioned practice leads to the adverse phenomenon where the upper investment grade range does not reflect expected default probabilities, i.e., a higher rating does not necessarily correspond to a lower risk of default. From a different perspective, Kraft (2015) compared ratings to issuers with rating-based performance-priced loan contracts to issuers with contracts based on accounting ratios and other loan agreements. The study examined adjustments to ratings, i.e., the difference between actual rating and the hypothetical rating implied by reported financials. The study found that, after an adverse economic shock, the adjustments made for firms with rating-based contracts are more favorable than for firms with other types of contracts. This finding is consistent with the hypothesis of rating catering and suggests that reputational concerns are not sufficient to fully eliminate this phenomenon.

The Data
For our analysis we used a cross-section of 112 European banking institutions over the period 2013-2017. In order to train our forecasting models, we compiled observations for 34 variables from the Bank-Focus/Orbis 14 database that originate from the banks' financial statements up to 4 years prior to the 2017 actual rating grade. Thus, counting the lags of the 34 independent variables, we compiled a total of 136 explanatory variables considered as possible forecasters of bank ratings, where each lag was treated as an independent variable. The motivation for selecting up to 4 years of data prior to the 2017 Fitch rating stemmed from the fact that, as discussed in the introduction section, CRAs often react to the information reflected in financial statements with a delay. We obtained and used the ratings from Fitch for the year 2017 and the financial statements for the period 2013-2017, due to data availability issues. The independent variables can be classified into four general categories: Assets, Liabilities, Income statement and Financial Ratios. In Table 1 we report the compiled financial variables used as independent variables (or features in the machine learning terminology) to our models. The dependent variable is ordinal and it is grouped in our case in four classes. These are assigned integer values from 0 to 3, such that lower values indicate a lower rating. The groupings of the four classes are depicted in Table 2. The grouping is performed is such a way so that the four identified classes contain a quasi-balanced number of banking institutions, forming a balanced dataset that avoids micronumerosity issues.

Support Vector Machines
Support Vector Machines is a supervised machine learning method used in data classification. The basic concept of an SVM is to select a small set of data points from the initial dataset, called Support Vectors (SV), that defines a linear boundary separating the data points in two classes. In what follows we describe briefly the mathematical derivations of the SVM theory.
We consider a dataset of vectors x i ∈ R 2 (i = 1, 2, . . . , n) belonging to 2 classes (targets 1 ) y i ∈ {−1, +1}. If the two classes are linearly separable, we define a boundary as: ( 1) where w is the weight vector and b is the bias. This optimal hyperplane is defined as the decision boundary that classifies each data vector to the correct class and has the maximum distance from each class. This distance is often called a "margin". In Figure 1, the SVs are represented with a contour circle, the margin lines (defining the distance of the hyperplane from each class) are represented by solid lines and the hyperplane is represented in the center. "margin". In Figure 1, the SVs are represented with a contour circle, the margin lines (defining the distance of the hyperplane from each class) are represented by solid lines and the hyperplane is represented in the center. In order to allow for a predefined level of error tolerance in the training procedure, Cortes and Vapnik (1995) introduced non-negative slack variables, 0, ∀ , and a parameter, C, describing the desired tolerance to classification errors. The solution to the problem of identifying the optimal hyperplane can be dealt through the Lagrange relaxation procedure of the following equation: where ξi measures the distance of vector xi from the hyperplane when classified erroneously, and 1, …, are the non-negative Lagrange multipliers. The hyperplane is then defined as: where : 0 is the set of support vector indices. In order to allow for a predefined level of error tolerance in the training procedure, Cortes and Vapnik (1995) introduced non-negative slack variables, ξ i ≥ 0, ∀i, and a parameter, C, describing the desired tolerance to classification errors. The solution to the problem of identifying the optimal hyperplane can be dealt through the Lagrange relaxation procedure of the following equation: where ξ i measures the distance of vector x i from the hyperplane when classified erroneously, and a 1 , . . . , an are the non-negative Lagrange multipliers. 1 In the SVM jargon. The hyperplane is then defined as:ŵ where V = i : 0 < y i < C is the set of support vector indices. When the two-class dataset cannot be separated by a linear separator, the SVM is paired with kernel methods. The concept is quite simple: the dataset is projected through a kernel function into a richer space of higher dimensionality (called a feature space), where the dataset is linearly separable. The solution to the dual problem with the projection of Equation (2) now transforms to: The SVM model can be extended to a multiclass classification method, using the one-against-the-rest approach; one class is kept aside and all others are grouped to form a new "grouped" class. After measuring the accuracy in forecasting the independent class kept aside, the second one is considered as independent and the others are grouped and so on until all classes are rotated. The overall accuracy is measured as the mean accuracy over all independent classes.
In our models we examined two kernels: the linear kernel and the radial basis function (RBf) 2 . The linear kernel detects the separating hyperplane in the original dimensional space of the dataset, while the RBF projects the initial dataset onto a higher dimensional space. The mathematical representation of each kernel is: Linear

Feature Selection
We identified the variables that contribute the most to the assigned bank ratings following a thorough regression-based variable selection procedure. The selected variables were then fed to both a Probit and an SVM model. As a first step, we measured the correlation coefficient, r i,R , between each independent variable i and the assigned rating R. Based on the correlation values, we created six groups of regressors as follows: In group 1, we included all variables with r i,R ≥ 0.4. This resulted in 18 variables in group 1; TASSET, NIM, TIR, NOEAA for the period 2013-2016 and NIRAA for years 2014 and 2016. In a similar manner, in group 2 we did the same for r i,R ≥ 0.4 along with all the lags of variable NIRA. This group included 20 variables. In groups 3 and 4, we included the 30 variables with the highest positive correlation and the 30 variables with the highest negative correlation, respectively. In group 5, the variables included were the five with the highest correlation with the dependent variable and the five with the lowest one, a total of 10 variables. The last group, group 6, contained the entire sample of explanatory variables, a total of 136 features. Table 3 summarizes the variables' groups. The next step was to use the selected groups in order to identify the most significant variables in terms of identifying bank ratings. This was done in each group either by: (a) a combinatorial exhaustive search methodology of all possible sets of four variables in each one of the above six groups, hand-picking the ones with the highest R-square and (b) the same process but with all possible sets of eight variables from within each one of the six groups, and (c) a stepwise forward least squares technique where we kept the set of variables with a p-value greater than 0.1.

Ordered Probit Model Results
The above selection procedure resulted in 18 different sets of regressors. These sets were fed to an ordered probit model that forecasts the credit bank rating assigned by Fitch for each institution for the year 2017. The evaluation of the forecasting accuracy of each forecasting model is depicted in Table 5. Each column corresponds to each one of the six groups of the prefiltered regressors while each row presents the forecasting results for the corresponding selection criterion. According to these results, the best accuracy using the probit model for all regressor selection criteria was achieved from the combinatorial search of eight variables from group 6. Stepwise-forward 57.14 57.14 50.89 46.43 51.79 57.14 Note: The highest accuracy is reported in bold. All values are percentages.
The best accuracy was 66.07% and the variables used were: In Table 6 we report the contingency table regarding the model's forecasts. The model achieved the best accuracy in class 2 that includes the A-, A, BBB+ ratings, with 70.59%. One might have expected this to be true for class 3, which includes the most creditworthy banking institutions. Thus, the model adheres closer to "mainstream" cases and identifies less accurately the classes 0 and 3. While accurately identifying class 3 can be of less importance, the accurate identification of soon-to-fail banks is of the utmost importance when it comes to financial risk management.

Support Vector Machines Model Results
We used the same 18 groups of variables to train our SVM models. In this methodological setting, we employed both the linear and the nonlinear RBF (Radial Basis Function) kernel. We followed two popular training schemes. In the first approach, we used a 3-fold cross validation scheme, while in the second one we followed a bootstrapping scheme of 8000 replications.
Cross validation is a common training scheme in machine learning applications. The basic idea is to split the dataset used in training the model into k folds of similar length and train iteratively, keeping at each step one fold aside for validation. For instance, in the 3-fold cross validation scheme we started by keeping the first fold aside and tuned the model's parameters based on the second and the third folds. The first fold was used to measure the forecasting accuracy of the trained model. We repeated the procedure by keeping the second and the third fold aside, respectively. The training accuracy of the model is the average over all three folds that were kept sequentially aside. The main advantage of cross validation training is that it avoids overfitting the data; thus, the model adheres more to the data generating mechanism that produces the phenomenon under investigation and less to the specific sample at hand.
In bootstrap training we created a large number of surrogate random samples of the same length as the original samples which they replaced. In this paper we created and trained 8000 samples and corresponding models. Then, from the distribution of the created forecasts we accessed the median and the 32nd and 68th percentiles in order to estimate the confidence intervals of the forecasts.
The forecasting accuracy of each model is depicted in Table 7 and for the cross validation and the bootstrapping training, respectively. In Panels A and B we present the results the linear and RBF kernels.  The best forecasting model used eight variables. It is interesting that only one of them dated to the most recent ratings year. This was Equity 2016, a measure of the capital invested in the bank or bank size in terms of stockholders' stake. The next most recent forecaster was Total interest received two years prior to the rating. This is a measure of income quality and profitability. It is very interesting that from the other six identified best forecasting variables, four dated three years prior to the ratings and two four years back. The four that dated three years back are Loans 2014, Gross Loans 2014, both measures of core business exposure and size of operations, Deposits and short-term funding 2014, a measure of liquidity and Other operating income/Average assets 2014, relating to income other than the core business of the bank. Finally, two variables dated a full four years prior to the target rating: Net interest revenues/Average assets 2013-a measure of operating efficiency and again Other operating income/Average assets 2013, the non-core income of the bank.
According to these results, it seems that operating efficiency, as it is measured in the financial statements as net interest and other operating income over assets, has a long and lasting effect on the financial health of a banking institution, as this is reflected in its corresponding credit rating. Short term funding (deposits etc.) and the size of the bank's core business (loans, total interest received) affect the ratings over a period of two to three years and the only close rating determinant is total equity. Given that the actual data were classes ranging from zero to three [0, 3], the best forecasting SVM model classified only two banks on the brink of default (class 0) while they actually belong to the highest class 3 and in five cases misclassified five banks of class 0 as class 3 banks. The latter can be considered as a more severe misclassification issue and we will focus in a future version of this manuscript, on utilizing a different kernel that is not based on normal distribution like the RBF kernel.
In Table 8, we report the contingency table of the most accurate SVM model's forecasts. The highest forecasting accuracy was achieved in class 0 that included banks on the brink of default, with a 100% percentage in correct classification.
In Table 9 we depict the confusion matrix (actual and forecasted classes) of the most accurate SVM model. As we observed from Table 9, 26 of the 32 instances of class 1 were forecasted correctly, while four are classified as belonging to class 2, one to class 1 and one to class 3. For class 2, 31 are classified correctly while three are classified into class 3. At class 3 in only one instance was a bank classified as belonging to class 2 instead of the actual class 3. Thus, in most cases the model classified instances into neighboring classes and the tendency was to misclassify by assigning a higher (economically healthier) class. Alternatively, instead of using solely the accuracy ratio, we can estimate the Area Under the Receiver Operating Curve (AUC-ROC). The higher the AUC, the more accurate the classification of the model. In Figure 2 we depict the AUC for the most accurate SVM and Probit model per class, respectively.
The SVM model achieved higher AUC for all four classes in comparison to the Probit model, reaching 0.95 for class 0, 0.79 for class 1, 0.77 for class 2 and 0.85 for class 3. The respective values of the Probit model are 0.86 for class 0, 0.64 for class 1, 0.66 for class 2 and 0.77 for class 3. Thus, either the accuracy of the AUC reached similar conclusions.  Alternatively, instead of using solely the accuracy ratio, we can estimate the Area Under the Receiver Operating Curve (AUC-ROC). The higher the AUC, the more accurate the classification of the model. In Figure 2 we depict the AUC for the most accurate SVM and Probit model per class, respectively.  Table 10 depicts the results of our bootstrapping training scheme. As we observe from Table 10, the most accurate model based on the bootstrapping training method was an SVM model coupled with the nonlinear RBF kernel, that included the four variables selected from group 5, based on a stepwise-forward selection scheme. The forecasting accuracy was 98.21% with a 95% confidence interval of [97.32, 100], and the independent variables are (a) Total Assets 2013 and 2014-a measure of capital invested in the bank, (b) Total Interest received 2016-a measure of bank profitability, and (c) Other interest bearing liabilities 2013-a measure of market exposure. As with the most accurate cross validation training scheme, the most accurate bootstrapping model was based on financial variables that measure the market exposure or the bank's profitability. Interestingly, Fitch reacts to the information included in financial statements with a delay, as variables of 2013 and 2014 adhered closer to the unobserved, underlying rating mechanism than information included in 2016. Thus, while we would expect that a rating agency would update its rating model yearly, we observe that this is not the case, and variables with a lag of 3 or 4 years are used. Given that the ratings can be proxied very closely by the publicly available data, an improvement in the disclosure of the data by the banking institutions could reduce the dependence on rating agencies.
Naturally, forecasting credit ratings (as accurate as it can be) does not bypass the problem that credit ratings themselves can be inaccurate in representing the true creditworthiness of borrowers (Parnes and Akron 2016;Parnes 2018). Nevertheless, the "true" creditworthiness of an E.U. banking institution will always be unknown, since it is dependent on private information that cannot be unveiled through public information. A natural extension of our work would be to compare the rating of Fitch to an alternative CRA in the framework of Parnes and Akron (2016), but we leave this avenue for future research.

Conclusions
In this study, we attempted to forecast the credit ratings of European financial institutions. To the best of our knowledge, this is the first time this has been done for European banks. In doing so we used a sample of 112 EU banks and tried to identify the most important factors contributing to their ratings. The target rating was the one provided by Fitch for the year 2017. In our approach, unlike what was done by CRAs, we only used publicly available data from the published financial statements of the banks. For each banking institution, we gathered data for 34 variables for 4 years prior to the 2017 rating, i.e., from 2013 to 2016. This resulted in 136 variables that were used as potential forecasters of Fitch ratings.
We followed a detailed variable selection procedure and created 18 alternative groups of regressors. First, we extracted six groups of variables based on different correlation criteria. Next, from each one of these six groups, we identified the most informative regressors using a combinatorial eight, combinatorial four and a stepwise forward procedure. This two-level variable selection scheme produced 18 alternative sets of explanatory variables. These regressors were then used in both a Probit model from classical econometrics and a Support Vector Machines (SVM) algorithm from the area of Machine Learning. In the case of the SVM model we used both the linear and the non-linear RBF kernel. Moreover, in this case and for the purpose of robustness of the results, we employed two training techniques: the standard cross-validation procedure with three folds to avoid overfitting and also bootstrapping with 8000 replications. The bootstrapping procedure enabled us to also produce confidence intervals for the forecasts.
Our empirical findings revealed that the SVM models vastly outperformed the Probit ones. The best Probit model reached an accuracy of 70.59%, while the best SVM with cross-validation reached 91.07%, and the best SVM with bootstrapping 98.21% with a 95% confidence interval of [97.32, 100]. The model based on the bootstrapping technique used only four independent variables as forecasters: (a) Total Assets of 2013 and 2014, implying that the size of a bank matters for the rating, (b) Total Interest received in 2016, which is a measure of bank income and (c) Other interest bearing liabilities from 2013, which is a measure of market exposure and capital expenses. Thus, the main drivers of bank ratings are the size, measured in total capital and profitability, measured by interest income and interest expense. It is interesting to see that the credit rating of 2017 was mostly determined by the size of the bank three and four years before, its capital expenses four years before and the interest income in the previous year.
Thus, capital and interest expense have a longer-term effect on the rating, explaining the apparent-and criticized-sluggishness of the CRAs in changing an assigned rating especially downwards. On the other hand, interest income has a more direct effect on the published rating. Moreover, we may infer from the apparent high accuracy of the best model that internal undisclosed information or other qualitative information used in the rating process by the CRAs, plays a very small role in the rating model. Author Contributions: Data curation, M.S.; Formal analysis, V.P. and E.D.; Methodology, V.P., P.G., T.P., E.D. and M.S.; Project administration, P.G.; Writing-original draft, V.P., P.G., T.P., E.D. and M.S.; Writing-review & editing, V.P., P.G. and T.P. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.