Modeling recovery rates for small- and medium-sized entities in the US

. A sound statistical model for recovery rates is required for various applications in quantitative risk management. We compare different models for predicting the recovery rate on borrower level including linear and quantile regressions, decision trees, neural networks and mixture regression models. We ﬁt and apply these models on the worldwide largest loss and recovery dataset for commercial loans provided by Global Credit Data, where we focus on small- and medium-sized entities in the US. Additionally, we include macroeconomic information via a predictive Crisis Indicator. The horserace is won by the mixture regression model with regressed weight probabilities. for clas-si(cid:28)cation result in lower MSEs than the models with the logistic regressions. This (cid:28)nding might be explained by the slightly lower prediction error of the neural network for the classi(cid:28)cation whether the RR attains values smaller than or equal to 0. However, the di(cid:27)erences in the MSE are marginal and since the logistic regression gives some more insight in the determinants of the RR, it might be preferable to use them. We compare the results of the decision tree approach with the regressions on the entire dataset and recognize that the models on the entire data result in lower MSEs than the decision trees.


Introduction
Additional capital requirements and an increased awareness of the importance of credit risk modelling are one consequence of the nancial crisis of 2007.Capital requirements, like the internal ratings-based approach of Basel II, allow nancial institutions to estimate their credit risk by own models.The main determinants of credit risk are the probability of default (PD), the exposure at default (EAD) and the loss given default (LGD), the latter is linked to the recovery rate (RR) via RR = 1 − LGD.We focus on the modeling of the recovery rate and compare dierent methods to estimate a rm-specic one.
According to 297 of the Basel Comittee on Banking Supervision (2004), LGD has to be measured as loss given default as a percentage of the EAD.However, there exist several methods to calculate the LGD (resp.RR), namely the market LGD, the implied market LGD and the Adress for correspondence: Amelie Schischke, Department of Mathematics, Technical University of Munich, Boltzmannstr.3, 85747 Garching, Germany, E-Mail: amelie.schischke@tum.deCarvalho (2006) and Qi and Zhao (2011) mention as drawback that in reality, RRs are bounded and not normally distributed.Nevertheless, the linear regression model outperforms the Tobit model and the decision tree model for UK credit card accounts in the study of Bellotti and Crook (2009).
Many authors have adapted regression models to the situation of RRs.The inverse Gaussian (IG) regression transforms the RR by an inverse Gaussian distribution function from the interval (0, 1) to the real line.Qi and Zhao (2011) compare this to the inverse Gaussian regression with beta transformation, which is also used by Gupton and Stein (2005), Loterman et al. (2012) and Yao et al. (2015), where the assumption of beta distributed LGDs is postulated and subsequently, the inverse Gaussian distribution is applied.Linear regression aims at predicting the mean, whereas a quantile regression can analyze the inuence of covariates on the entire distribution.Krüger and Rösch (2017) emphasize that quantile regression might hence be better suited for downturn scenarios.
In order to model the concentration of RRs at the boundaries {0, 1}, Bellotti and Crook (2009) propose a decision tree model which is also used by Yao et al. (2015).A logistic regression model decides whether the RR takes the values 0 or 1.Subsequently, an ordinary least squares method is used inside (0, 1).Similarly, Loterman et al. (2012) use a logistic regression to determine whether the RR takes the boundary values and dierent parametric as well as non-parametric models to explain the RR inside (0, 1), but the single application of the non-parametric models, esp.neural networks and the least squares support vector machines, outperforms the combinations.
Besides Loterman et al. (2012), non-parametric models are also studied by several researchers: In Bastos (2010b) and Qi and Zhao (2011), neural networks outperform the fractional response regression resp.linear regression, Inverse Gaussian regression, Inverse Gaussian regression with Beta transformation and the fractional response regression.However, Qi and Zhao (2011) mention as drawback that neural networks are a black-box, because there is no straightforward method to interpret the relationship between the independent and dependent variables.
Another type of models, which are considered in dierent ways to predict the RR, are nite mixture models.Krüger and Rösch (2017) use a normal mixture distribution with two components for LGD and nd that it performs best with their quantile regression on the GCD subset of US SMEs.However, Ye and Bellotti (2019) propose a two-stage model to apply a beta mixture model for the RRs in (0, 1).This two-stage model outperforms the OLS, OLS with lasso as well as the beta regression.In addition, Tomarchio and Punzo (2019) present zero-and-one inated mixture models.A three-level multinomial model rst decides, whether the LGD takes the value 0, 1 or lies in (0, 1).Subsequently, nite mixture distributions are applied to (0, 1), in which they test dierent component distributions.
In the study of Altman and Kalotay (2014), the transformed RR by the inverse normal distribution is approximated as a mixture of Gaussian distributions, where only the probability belonging to a certain component depends on covariates.Wang et al. (2018)  LGD is modeled as a mixture of the expansion and recession distribution where each distribution is represented by the mixed model in Calabrese (2012).The mixtures represent the credit cycle whether there are bad or good times.

Modeling Methods
This section provides a theoretical background of the techniques used in this study.We focus on decision trees, neural networks and mixture regression models.We refer to Fahrmeir et al. (2013), Hosmer and Lemeshow (2013) and Krüger and Rösch (2017) for more information on regression methods, in particular for quantile regression models as well as for model selection techniques.

Decision Tree
Since RRs are not normally distributed, a linear regression might not be adequate.As an alternative, the RR can rst be transformed and then, on the transformed data, a linear regression can be applied.In the literature, e.g. in Gupton and Stein (2005), a beta transformation is used.
The transformed RR is: where Φ −1 is the quantile function of the standard normal distribution function and F Beta (x, α, β) is the distribution function of the beta distribution with shape parameters α and β, which have to be estimated.However, this transformation can only be applied to RR ∈ (0, 1).As our dataset also contains observations with a RR smaller than zero or greater than one, we use a decision tree approach as displayed in Figure 1.Firstly, a logistic regression (or a neural network) determines the probability p that the RR is greater than or equal to 1.Then, a second classication model, i.e. a logistic regression (resp.neural network), estimates the probability q that the RR takes a value less than or equal to 0, given that it is smaller than 1.We use a linear regression to predict the rates RR ≥1 and RR ≤0 .Inside (0, 1), we apply the beta transformation (1) to the RR.Subsequently, a linear regression (or a quantile regression) estimates the rate RR (0,1) .If the linear regression had been applied to the raw RR, there would be predicted values outside (0, 1).Therefore, we rst apply the beta transformation and, on the transformed RR, we can use the linear regression.In contrast to the linear regression, the estimates of the quantile regression would not exceed the open unit interval.Therefore, we apply this regression type on the raw RR and compare the results.We mention that, according to our results, it is better to apply the quantile regression on the raw RR ∈ (0, 1).Hence, in the following, the corresponding results are presented.The expected RR is expressed as a weighted average, where the weights are p, (1 − p) • q and (1 − p) • (1 − q).Hence, the expected RR is:

Neural Networks
In this section, we present the structure of feedforward neural networks following Hastie et al. (2001) and Günther and Fritsch (2010).
In a neural network, whose structure is presented in Figure 2, neurons are structured in layers.
The neurons are connected by synapses, which are graphs between them and the neurons of the subsequent layers.In order to keep the model simple, we consider feedforward neural networks with one hidden layer.The input layer contains all covariates, the so-called input variables X 1 , ..., X p , which represent the separate neurons.Each numerical attribute has its own neuron.
In case of categorical variables, dummy coding as in a linear regression is applied.The output layer has K neurons O 1 , ..., O K .For regression problems with one response variable as well as for classication problems with two categories, we have K = 1.The propagation function connects the output values of the previous layer O j,previous layer , j ∈ previous layer such that the result can be used as input I i,current layer for a neuron i ∈ current layer in the current layer.We use the weighted sum: I i,current layer = j∈previous layer w i,j O j,previous layer .
The activation function σ transforms this value I i,current layer to the output value of the neuron O i,current layer = σ(I i,current layer ).For this, we use the sigmoid function: .
The propagation function is applied again to receive the input for the output layer.Then, for the output neuron O k , k = 1, ..., K, in the output layer, we apply a nal transformation by the output function g k instead of the activation function.In case of regression problems, we use as g k the identity function, whereas we apply the softmax function g in case of classication problems.
The weights have to be estimated in the training process.Therefore, we use the backpropagation algorithm in case of neural networks for classication problems as backpropagation is the most widely used algorithm for supervised learning with multilayered feed-forward networks according to Riedmiller and Braun (1993).In case of regression problems, we use the extension RPROP+ algorithm of Riedmiller and Braun (1993) and refer for more information to their original paper.

Mixture Models
In a linear regression model, we assume that the dependent variable relates to the covariates by a xed parameter β over all observations.This assumption is often too restrictive, calling the need for models in which the regression coecient can change over dierent clusters among the observations.One family of models are nite mixture models, which will be presented following Frühwirth-Schnatter (2006), Grün et al. (2008), Leisch (2004) and Murphy (2012).
In general, a nite mixture regression model with K components has the form: where π k , k = 1, ..., K are the weights with π k ≥ 0, K k=1 π k = 1 and ψ = (π 1 , ..., π K , θ 1 , ..., θ K ) is the vector of all unknown parameters.θ k denotes the component specic parameter vector for the density function f .If f is a univariate normal density with component specic mean β k x and variance σ 2 k , we get a mixture of standard linear regression models with θ k = (β k , σ 2 k ) .
The weights π k , k = 1, ..., K in Equation ( 2) are usually independent of the covariates.One extension is the concomitant variable model by Grün et al. (2008), which assumes that the weights depend on some variables, the so-called concomitant variables denoted by c.Then, the mixture model can be written as: where α denotes the parameter vector of the concomitant variables and ψ contains all parameters including α.The remaining arguments are dened as in Equation ( 2) and the weights have to satisfy the conditions π k (c, α) > 0 and Grün et al. (2008), we assume a multinomial logit model for the weights π k , which can be written as: for all k = 1, ..., K and with α = (α k ) k=1,...,K and α 1 ≡ 0.
For parameter estimation, we write the log-likelihood function of a sample of n observations (x 1 , y 1 ), ..., (x n , y n ) as: Since the membership to the components is unknown, this likelihood function cannot be computed directly.We use for maximum likelihood estimation the iterative Expectation-Maximation (EM) algorithm introduced in Dempster et al. (1977).It is implemented for the concomitant mixture models in the R-package flexmix.

The Global Credit Data (GCD) database
As Krüger and Rösch (2017), we use a dataset of US-based small-and medium-sized entities (SME) from Global Credit Data (GCD) for our empirical analysis.GCD is a dutch-based, notfor-prot registered association whose owners are more than 50 Member-banks across the world.
The objective of GCD is to be a credit risk data pooling initiative to support the Member-banks by their internal credit risk models inter alia for the advanced internal ratings-based approach of Basel II.We use the LGD&EAD platform, which is the worldwide largest loss and recovery dataset for commercial loans, and contains data relating to credit defaults since 1998 until the end of 2016.This time period encompasses more than one full economic cycle as required by 472 in Basel Comittee on Banking Supervision (2004).Table 9 in the Appendix gives an overview over all variables used.
We adjust the data following Höcht and Zagst (2008).First, the exposure at default has to be strictly greater than zero as the focus of this study lies on real losses.Second, we only consider loans where EAD + Principal Advance + Financial Claim ≥ e 5,000, such that very small exposures are excluded.Third, the default date lies in the interval [January 2002, December 2015].We exclude cases before the year 2002 due to modied banking regulations.As the cases after 2015 might still be unresolved, we exclude them as well.Fourth, to exclude all facilities that are not fully resolved or exhibit unreasonable cash ows, the following rule is applied according to Höcht and Zagst (2008): If the total sum of all reported cash ows (including charge-os and waivers) divided by the outstanding amount at default is smaller than 90% or greater than 105%, the facility is not considered.Fifth, only cases with resolved default status are of interest.
Finally, the RR lies in the interval [−0.5, 1.5].All observations with smaller or greater RR are excluded to avoid outliers.
Furthermore, we split the data into 3 groups: training, validation and test set.The training set, in regression problems the so-called in-sample set, contains 80% of the data according to Murphy (2012) and is used to estimate the models.In order to get an impression how well a model can create new predictions, the trained models are applied on the test set, which is also called out-of-sample set.This data is not used in the estimation of the model and therefore, these results are reliable and can be compared.Some models need hyperparameters, for example the number of hidden neurons in a neural network.Since the training data is already used and the test data should remain independent of the modeling process, we use a third dataset, the validation set, to t the hyperparameters.The test set as well as the validation set both contain 10% of the data.The histogram of the RR is presented in Figure 3 and shows a high concentration at full recovery.Furthermore, there are two additional peaks near 0 and 0.5.In literature, the RR has frequently been modeled using a bimodal structure, for example in the studies of Altman and Kalotay (2014), Bastos (2010a), Bastos (2010b) and Qi and Zhao (2011).Similar to our data, Ye and Bellotti (2019) use a trimodal distribution.

Predictive Crisis Indicator
Some studies, for example Calabrese (2012) and Höcht and Zagst (2008), nd that the recovery rate tends to be lower during economic downturns.Brumma and Winckle (2017) observe that the macroeconomic behavior during the resolution time has an inuence on the recovery.Therefore, we use a predictive crisis indicator, which indicates whether a crisis might occur in the next 18 months (the average resolution time).
To model the predictive Crisis Indicator, we rst calculate a daily Crisis Indicator using a modied version of the algorithm of Ernst et al. (2009), where we use two-year highs instead of half-year highs.The algorithm of Ernst et al. (2009)  In this paper, the Crisis Indicator is used in dierent ways: (C1) We do not include the crisis information at all.
(C2) The predicted Crisis Probability calculated from the logistic regression model is included as a covariate.
(C3) The Crisis Indicator is included as a covariate.
(C4) We split the data into crisis and non-crisis dataset and train the models on each subset.

Empirical Results
The focus of this study lies on mixture regression models and thus, we only briey present the best results of the regression models, decision trees and neural networks and subsequently, we concentrate on the mixture models.We also give an overall comparison of all models.
In order to decide for the best model, we use the mean squared error (MSE) measure of t, dened as: where y i , i = 1, ..., n are the observed RRs and ŷi are the estimated RRs.This measure is also used, e.g., in Calabrese (2012), Gupton and Stein (2005) and Ye and Bellotti (2019).

Regression Models
First, we consider the results of the regression models.We apply stepwise selection for model selection in the dierent regression problems based on the BIC, as it penalizes model complexity on a larger scale compared to the AIC.We will use this model selection criterion in every regression problem in the following.
We concentrate on the linear regression model including the Crisis Indicator as well as the models trained on the crisis and non-crisis subsets, since the results of these models performed best (see Table 6).Regarding the included covariates within the linear models, which are presented in Table 2, we recognize that in case of crisis, the RR is only determined by the information whether a guarantee or collateral is given and the size of EAD.Moreover, in the crisis case, the In case of the quantile regression model, we regress the median in order to compare the results of the quantile regression to the results of the linear regression.The quantile regression including the Crisis Indicator outperforms the remaining models considering MSE (see Table 6).Therefore, we have a look at the included variables of this model and recognize that the model selection results in more attributes for the quantile regression as in case of the linear regression models.In particular, the variables Country of Business, Leveraged Finance Indicator, Operating Company Indicator, Collateral Rank of Security as well as Guarantor Rating Moodys are only included in the quantile regression model, whereas the remaining covariates are part of both models.

Decision Tree
We use a decision tree approach in order to apply the beta transformation on the RR.Similar to the regression models in Section 5.1, we use stepwise selection with BIC for model selection.
Besides the logistic regression, a neural network can be applied for the categorization problems.Therefore, we use the function nnet of the R-package nnet.As input, all available attributes are used.The network is trained by the backpropagation algorithm and only has one hidden layer for simplicity.Furthermore, the optimal number of hidden neurons is tested.
Therefore, the network is trained on the training data with dierent numbers of hidden neurons from 1 to 10.The prediction error of the validation set is the selection criteria and results for both classication problems, whether the RR is greater than or equal to 1 and whether the RR is smaller than or equal to 0, in one hidden neuron.† We rst remark that we report the trees with the quantile regression for the median applied on the raw RR in the open unit interval, because they outperform those on the beta transformed RR.One reason for this might be that the RR in our dataset is trimodal and the beta distribution might not t well.
Table 3 gives an overview which covariates are included in the dierent regression problems.
We focus in the unit interval on the linear regression model without crisis information and the quantile regression including the Crisis Probability, as the results of the decision trees including these models outperform the other decision trees.Similar to the regression problems above, the quantile regression model including the Crisis Probability contains more variables than the linear regression model.Furthermore, the Crisis Indicator aects only the logistic regression to decide whether the RR attains a value smaller than or equal to zero.†In general, a neural network with one hidden neuron and the sigmoid function as activation function equals a logistic regression model.Since the estimation method is dierent (backpropagation algorithm in case of a neural network and maximum likelihood estimation in case of a logistic regression), the parameters of the two models can be dierent.Table 3: Included attributes in the regression models of the decision trees.

Neural Network
Another possibility to model the RR are neural networks whose results are shown in the following.
We begin with the description of the predetermined model parameters and present the results in Table 6.
In order to train neural networks for regression problems, we use the function neuralnet of the R-package neuralnet, which applies the RPROP+ algorithm.Therefore, we set the multiplication factors for the upper and lower learning rate to η − = 0.5 and η + = 1.2, the parameter threshold to 0.01 and the maximum number of iterations to 1e7.We use the sigmoid activation function, the identity as output function and the sum of squared errors as error function.
For reasons of simplicity, all neural networks contain one hidden layer and the optimal of hidden neurons is determined by minimizing the MSE on the validation set.
We use all available attributes as input variables and do not apply any model selection in advance, because the neural network should identify the important variables by its own.
For categorical covariates, dummy variables are created just like for the regression problems.
According to Lantz (2015), we scale the metric data to the unit interval [0, 1] as it is not normally distributed.

Mixture Regression Models
In this subsection, we present the results of the mixture regression models from Section 3.3.At rst, we apply the model selection to identify relevant covariates.Subsequently, the results of the mixture models with and without concomitant variables are shown.

Model Parameters
Our motivation to investigate mixture regression models stems from the observation of multiple modes.Since the RR is trimodal in our dataset, our mixture regression models have three components.We use the package flexmix in R to t our models.Unfortunately, there is no model selection implemented.In addition, the EM-algorithm to t the model does not converge for every possible combination of input variables.For a pre-selection and in order to reduce the overall number of covariates, we use the input variables of Krüger and Rösch (2017), who base their study on loan data of SMEs in the US provided by GCD.We use our crisis information instead of macroeconomic data and focus our analysis on entity level, hence we can not use all attributes of Krüger and Rösch (2017).In conclusion, the resulting variables are log-transformed EAD, Guarantee Indicator, Collateral Indicator, Primary Industry Code as well as the Crisis Indicator resp.Probability.Subsequently to this pre-selection, all possible combinations of variables are formed and it is tested whether the EM-algorithm converges.We compare the mixture regression models with the BIC and the best model contains the covariates Collateral Indicator, log-transformed EAD and Crisis Probability.For simplicity reasons, this combination of variables is also used as concomitant variables in the following.We notice that it is better to use the Crisis Probability than the Crisis Indicator in this method.
The package flexmix provides information about the standard error as well as z-and p-value for every coecient in every component.In case of a negative entry in the diagonal of the variance-covariance matrix, the standard error can not be computed.This is partially the case for the coecient of the log-transformed attribute EAD.
Therefore, we exclude this attribute for the regression problems, but it is still part of the multinomial model.In order to distinguish the dierent models, we refer to the model including the EAD as covariate by the name "M1" and denote the model without the EAD as "M2."If, additionally to the components, the probabilities to belong to the components are regressed, we denote the models "M1C" and "M2C," since the attributes included in the multinomial models are called concomitant variables.

Model Description
Comp  are vanishing.Furthermore, the intercept at 0.5 as well as the attribute Collateral Indicator have the most impact on the second cluster.In contrast to the other clusters, the characteristic Yes of the Collateral Indicator has a negative impact on the third cluster.Moreover, the inuence of the characteristic No of the Collateral Indicator as well as the Crisis Probability is negative.
In addition, the EAD has a slightly positive impact.The third component has the highest uctuations (represented in a Sigma of 0.308).
Having a look at M2, we notice that the rst component is mainly inuenced by the intercept near one and the attributes have little impact.In comparison to the control group Unknown, the categories Yes and No of the Collateral Indicator have an positive impact on the second component.As the parameter of Yes of the Collateral Indicator is higher than the coecient for No, we would expect higher values for entities having a collateral.It is counterintuitive that this component increases its value if the Crisis Probability increases.The highest values of the third component are expected in case of a collateral, whereas the lowest values will be attained when there is no collateral.In addition, the value of this cluster will be higher if the Crisis Probability is small.The attributes have the highest impact on the third cluster due to their higher absolute values.Sigma with a value of 0.270 underlines this nding, as it is higher than the sigma of the rst and second cluster.
Aleksey Min et al.
The rst as well as the second component of the mixture regression model M1C are mainly determined by the intercept near one, as the coecients of the covariates are near zero and inuence them little.Sigma of the third component is the highest, indicating higher uctuations.
The characteristics of the Collateral Indicator inuence the third component negative due to their negative parameters.Moreover, the log-transformed EAD has a positive coecient, which indicates that the value of the third cluster correlates to the EAD.In addition, a higher Crisis Probability results in a higher value for this component which is counterintuitive.
In model M1C, we regress the probabilities that an observation belongs to a certain component.In this study, a multinomial logit model is assumed for the weights π k , k = 1, ..., K as depicted in Equation ( 4).One assumption of this model is α 1 ≡ 0. Therefore, only the parameters of the second and third component are given in Table 5 The probability of belonging to the second component increases if no collateral is available and decreases if a collateral is given.Moreover, a borrower with lower EAD (resp.a lower Crisis Probability) is expected to have a higher probability of belonging to the second cluster.
Furthermore, the probability that an entity belongs to the third cluster is expected to be lower in case of a collateral and higher in case of no collateral.The log-transformed EAD has again a negative inuence, whereas the Crisis Probability has a positive one.
In model M2C, we would expect that the value of the rst component is higher in case of no collateral than in case of a collateral.In addition, a higher Crisis Probability will lead to higher values of the rst component.This cluster has the highest variation which is displayed in the high value of Sigma of 0.182.The covariates have little impact on the second component, which is mainly determined by its intercept near one.Moreover, the value of the third cluster is lower for an observation with a collateral than for one without any collateral.Furthermore, the third component is expected to attain a higher value for a higher Crisis Probability.
The probability that an observation belongs to the second cluster increases if it has no collateral.However, having a collateral decreases the probability.In addition, the higher the log-transformed EAD or the higher the Crisis Probability, the lower the probability that an observation belongs to the second cluster.Furthermore, the probability that an observation belongs to the third cluster reaches a maximum if there is no collateral.Moreover, we would expect a lower probability that an entity belongs to the third cluster if the EAD is high or the Crisis Probability is low.

Comparison of all models based on MSE
Table 6 shows the in-sample as well as the out-of-sample results for all models including the linear regression model with dierent assumptions (C1),...,(C4) on the crisis information.For the linear regression, MSE prefers in-sample the model including the Crisis Indicator and outof-sample to separate the data into crisis and non-crisis subsets.Since the models on the split data give more insights in the determinants of the RR in crisis and non-crisis case, this approach might be preferred, as the goodness-of-t of the models is similar.
In-sample Out-of-sample Having a look at the quantile regression, the model including the Crisis Indicator outperforms the remaining models considering MSE.Moreover, the linear regression models outperform the quantile regressions.This fact might be explained by the dierent optimization problems.The estimation method of the linear regression minimizes the least squares error, whereas the quantile regression for the median minimizes the mean absolute error.However, the quantile regression models give more insights into the structure of the distribution, since dierent quantiles can be modeled.We refer to Krüger and Rösch (2017) who calculate further quantile regressions for several quantiles.
Comparing the decision trees by MSE, the models with the linear regression in (0, 1) outperform those with the quantile regression.Moreover, it is preferable to not include any information about a crisis for the decision tree with the linear regression in the unit interval, whereas the decision tree with quantile regression including the Crisis Probability outperforms the other decision trees with quantile regression.Additionally, the trees with the neural networks for classication result in lower MSEs than the models with the logistic regressions.This nding might be explained by the slightly lower prediction error of the neural network for the classication whether the RR attains values smaller than or equal to 0. However, the dierences in the MSE are marginal and since the logistic regression gives some more insight in the determinants of the RR, it might be preferable to use them.We compare the results of the decision tree approach with the regressions on the entire dataset and recognize that the models on the entire data result in lower MSEs than the decision trees.
Regarding the results of the neural networks, we notice that the network including the Crisis Probability outperforms in-sample as well as out-of-sample the other neural networks.Furthermore, we recognize that the MSE is in-sample small, whereas the results are out-of-sample similar to the MSE of the logistic regression models.One reason for the dierence in the MSE between the in-sample and out-of-sample subset might be overtting.
Finally, we consider the results of the mixture regression models.The models excluding EAD as covariate are superior to the mixture regression models including the covariate EAD.
In addition, the models which regress the densities as well as the probabilities outperform the mixture models with xed probabilities.In conclusion, model M2C is the best model.
In-sample as well as out-of-sample, the mixture regression models outperform the regressions as well as the neural networks.One reason for this might be that the mixture regression model can display the dierent modes better than the other models.

Practical consequences from the best models
In the following, we compare the three best models: The mixture regression model M2C, the neural network including the Crisis Probability and the linear regression model with separate subsets for crisis and non-crisis.
We investigate the dierence d i between the predicted RRs Ri and the observed RRs R obs.i : i , for i = 1, ..., n where n is the number of observations.From the risk-managers point of view, a situation in which the RR is conservatively underestimated is favorable compared to a situation in which the RR is overestimated.
Therefore, we are interested in the number of observations where the dierence between the predicted RR and the observed RR exceeds a certain threshold θ ∈ {0.1, ..., 0.9} proportional to the overall number of observations: In-sample, the mixture regression model M2C overestimates the true RR by more than θ = 0.1 in 14% of all cases, whereas the linear regression model as well as the neural network overestimates the RR even in 29% of all observations.Having a look at θ = 0.2, we recognize that the mixture regression model M2C only overestimates the observed RR in 3.9% of all cases.The results of the linear regression as well as the neural network are worse, since 25% of all cases predict a RR which exceeds the true RR by more than θ = 0.2.In addition, there is no observation where the predicted RR of the mixture regression model exceeds the true RR by more than θ = 0.6.
Exemplary, if we estimate a RR of 1, the true value is bigger than 0.4.Thus, for a case where a full recovery is predicted, we know that at most 40% of the exposure at default will be recovered.
In case of the linear regression model as well as the neural network, there are cases where the predicted RR overestimates the true RR by more than θ = 0.9.We refer to the same example as above.If the estimated RR is 1, the true RR can be smaller than 0.1, which is almost a total loss even though the model predicts a full recovery.Moreover, we notice that the behaviour of the linear regression model and the neural network is similar.
The out-of-sample results are similar to the in-sample results.The behaviour of the linear regression model equals the behaviour of the neural networks.Moreover, regarding the maximum dierence, the out-of-sample results for the mixture regression model M2C are even slightly better than in-sample, since there is no prediction which overestimates the true RR by the value θ = 0.5.
Similar to the in-sample results, the mixture regression model overestimates the true RR by more than θ = 0.1 in 14% of all cases, whereas the linear regression model as well as the neural network exceed the observed RR in 30% of all observations.In addition, the true RR is overestimated by more than θ = 0.2 in only 4.1% of all observations in case of the mixture regression model and in the maximum dierence is smaller in case of the mixture regression model and the predicted RR exceeds the observed RR by 0.1 resp.0.2 in only 14% resp.4.1% of all cases instead of 30% resp.
26% of all cases, we conclude that the mixture regression model M2C outperforms the neural network as well as the linear regression model.

Summary and Conclusion
We compared dierent models to predict the RR; namely regression methods, decision trees, neural networks and mixture regression models.Additionally, we investigated how information on an economic crisis can be introduced into the models.
For our analysis, we considered a dataset of US-based SMEs obtained from GCD.We use the denition of the workout RR.Empirical RRs exhibit a multimodal structure with three modes at 0, 0.5 and 1.Since earlier studies in literature point out that an economic crisis during the time to resolution has an impact on the RR, we use a predictive Crisis Indicator (resp.Crisis Probability).
The best models are in-sample as well as out-of-sample the mixture regression models, especially the concomitant variable model which regresses the density as well as the probability that an observation belongs to a certain cluster.We nd by model selection with the BIC that including the Crisis Probability is preferable compared to including the Crisis Indicator.The neural network outperforms in-sample the linear regression model, but the results are similar out-of-sample.The quantile regression models lead to higher MSEs than the linear regression models.Decision trees performed worst in our study.
Concluding, let us propose some areas for future research in predicting and modelling the RR.In the present study, the RR can take values greater than one as well as smaller than zero and we conclude that the mixture regression models outperform the other models.Since most of the studies consider observations with RR in the unit interval, the question arises whether the mixture regression models also outperform the other methods in such a restricted dataset.
In addition, there are some parameters which can be modied.For regression models, we do not consider interactions, yet.Moreover, the number of hidden layers in neural networks can be adjusted.Additionally, dierent activation and error functions can be tested.
extend this mixture model on the Moody's Ultimate Recovery Database by introducing a Markov switching model with two states, representing crisis and non-crisis periods to capture cyclical aspects.For each state, there is a mixture model with four components for the transformed RR which enables the determination of the inuence of covariates.Similar to the decision trees, Calabrese (2012) presents a mixed continuous-discrete model.In her further work, Calabrese (2014) extends her model by introducing a mixture model.The
For classication problems with C classes, there are K = C output neurons, each representing one category.The hidden layer with neurons H 1 , ..., H M lies in between and cannot be observed directly.A bias can be added to the input and hidden layers as an extra neuron B I resp.B H .

Fig. 3 :
Fig. 3: Histogram of the recovery rate in our data (SME, US-based).

Table 1 :
Ernst et al. (2009)ny stock index, but for the focus on recovery rates of SMEs in the US, we chose S&P500.With theErnst et al. (2009)Estimated logistic regression model for the Crisis Indicator.
algorithm, a daily Crisis Indicator is determined.To get a monthly aggregated Crisis Indicator, we apply the following decision rule: If at least 2 days within a month are indicated as crisis, the month in total is considered as crisis.In the next step, a predictive Crisis Indicator needs to be built.For every month m, we consider the period of the next 18 months [m + 1, ..., m + 1 + 18].If there is at least one month in crisis, the predictive Crisis Indicator for m is set to 1 (indicating a crisis).Up to this point, the calculations are made on historical data and the predictive Crisis Indicator can only be obtained, once the data for the next 18 months is available.Since the goal of this study is to predict the RR at the date of default, the required information is not yet available.Therefore, the predictive Crisis Indicator has to be modeled.For this, we set up a logistic regression model with macroeconomic data and Table1shows the included attributes and their impact.

Table 2 :
Collateral Indicator has an impact on the RR, whereas the Primary Industry Code as well as the Utilization Rate only have an impact in the non-crisis case.However, the linear regression model with the Crisis Indicator includes all these attributes and additionally the variable Nature Included attributes in the linear regression models (LR) on the crisis/non-crisis subset resp.the entire subset including the Crisis Indicator and in the quantile regression model (QR) including the Crisis Indicator.

Table 4 :
Summary of the estimated mixture regression models.

Table 4
displays the results of the estimated models.Firstly, we consider model M1.The rst component of M1 is mainly determined by the intercept near one, since the remaining coecients

Table 5 :
. Summary of the estimated concomitant regression models.

Table 6 :
In-sample and out-of-sample results for the estimated linear regression (LR), quantile regression (QR) models, decision trees (DT) with linear or quantile regression in the unit interval (LR/QR) and logistic regression or neural network for the classication problems (LogReg or NN), neural networks (NN) and mixture regression models.

Table 7 :
In-sample results for the dierence between the estimated and observed RR.The results are presented in Table7for the in-sample and in Table8for the out-of-sample data.

Table 8 :
Out-of-sample results for the dierence between the estimated and observed RR.