Meta-Learning Approaches for Recovery Rate Prediction

: While previous academic research highlights the potential of machine learning and big data for predicting corporate bond recovery rates, the operations management challenge is to identify the relevant predictive variables and the appropriate model. In this paper, we use meta-learning to combine the predictions from 20 candidates of linear, nonlinear and rule-based algorithms, and we exploit a data set of predictors including security-speciﬁc factors, macro-ﬁnancial indicators and measures of economic uncertainty. We ﬁnd that the most promising approach consists of model combinations trained on security-speciﬁc characteristics and a limited number of well-identiﬁed, theoretically sound recovery rate determinants, including uncertainty measures. Our research provides useful indications for practitioners and regulators targeting more reliable risk measures in designing micro- and macro-prudential policies.


Introduction
Credit risk is the main concern for financial institutions. It is governed by three factors: the exposure at default (EAD), the default likelihood (that is, the probability of default, PD) and the loss given default (LGD). The latter can also be expressed as (1 − RR), where the recovery rate RR represents the percentage of EAD that can be recovered in the event of default of the reference entity. For a long time, financial institutions focused on the modeling of PDs. Ratings, for example, essentially quantify the risk of a default event, the loss severity receiving little attention. However, EAD and RR are key drivers of credit risk. For instance, they act as scaling constants when computing banks' regulatory capital requirements (BCBS 2017;Loterman et al. 2012;Zhang and Thomas 2012) in Basel II (BCBS 2006) and Basel III frameworks (BCBS 2011) that banks can estimate using internal models in the advanced (A-IRB) approach. But RR also influences the value of nonperforming loans (Bellotti et al. 2021), corporate bonds and credit derivatives (Andersen and Sidenius 2004;Berd 2005;Gambetti et al. 2018;Pykthin 2003), as well as the price of mainstream OTC derivatives' products such as equity options and interest rates swaps due to counterparty risk (Gregory 2012). This explains why recovery rates modeling and forecasting receive more and more attention in current times.
Academic research highlights how nonlinear techniques (Loterman et al. 2012;Qi and Zhao 2011;Tobback et al. 2014;Yao et al. 2015) and ensemble methods (Bastos 2014;Bellotti et al. 2021;Hartmann-Wendels et al. 2014;Nazemi et al. 2017) should be preferred to the traditional parametric regressions used in earlier studies on recovery rate determinants. This finding is not confined only to RR forecasting, but it is also in credit scoring Wang et al. 2022). However, the possibility of combining predictions from a large number of different algorithms remains widely unexplored. This is surprising given that it is now well recognized that combining models might be preferable (Atiya 2020) to selecting a single model. In this paper, we confirm this conclusion by showing that meta-learning techniques exhibit better forecasting performance and lower model risk than methods in previous research when dealing with RR. 1 Meta-learning can be defined as the act of a) selecting and b) optimally combining the outputs of multiple learning machines (first-level learners), according to some combination scheme that is learned from the data (meta-learner or second-level learner) (Santos et al. 2017). Meta-learning is very flexible: the models to be combined can be represented by expert judgment, individual learners (e.g., lasso, MARS) or ensemble learners (e.g., random forests, boosted trees), whose predictions themselves are already combinations of the outputs of different models. Furthermore, the models to combine can also be trained with different predictors. This contrasts with homogeneous ensemble methods in which, despite the model combinations that are involved, they generally combine weak learners 2 of the same type, e.g., regression trees are combined into a random forest.
The main advantage of meta-learning is, hence, the ability to exploit a wide spectrum of functional forms. Its promising applicability for recovery rate prediction was first reported in Nazemi et al. (2017) using a fuzzy decision fusion approach. However, several other alternatives are worth exploring. For example, Roccazzella et al. (2021) propose a robust and sparse combination method that mitigates estimation uncertainty and implicitly features forecast selection in a single step. This approach has the additional advantage of relying on a closed-form solution for the estimation of the optimal combination strategy.
The quality of the forecasts depends also on the predictive variables we consider. Prior studies on defaulted bonds highlight the importance of security-specific characteristics (Altman and Kishore 1996;Bris et al. 2006;Jankowitsch et al. 2014;Schuermann 2004), economic conditions and the credit cycle (Acharya et al. 2007;Altman et al. 2005;Betz et al. 2018Betz et al. , 2020Bruche and González-Aguado 2010). 3 Models based on big data and variable selection techniques have been proposed by Nazemi et al. (2017 and . Gambetti et al. (2019) show that economic uncertainty is the most important systematic determinant of recovery rate distributions. However, the latter study is based on an ex post analysis. Given that uncertainty shapes the economic outlook (ECB 2009(ECB , 2016Gieseck and Largent 2016;Kose and Terrones 2012) and its proxies are particularly capable of anticipating economic downturns (Ludvigson et al. 2019), it remains to be explored whether uncertainty proxies can also improve predictive performance.
In this paper, we extend the literature on recovery rate modeling studying the forecasting performance of linear, nonlinear, rule-based and meta-learning algorithms across different specifications of the predictors set.
We start our analysis by studying whether a larger set of macro-financial indicators, uncertainty measures and idiosyncratic features in bond recovery rate forecasting models offers a significant advantage compared to a more parsimonious but economically grounded framework. Specifically, we extend the set of macroeconomic predictors used in Nazemi et al. (2017) and  with 55 pricing factors and industry portfolio returns. We also enlarge the spectrum of uncertainty proxies considered by Gambetti et al. (2019) with 11 additional measures of economic uncertainty derived from text-analysis techniques. This offers two advantages. First, as Jurado et al. (2015) point out, text-based indexes can show significant independent variations from other popular uncertainty proxies, suggesting that much of the variation in these proxies is not driven by uncertainty itself. Second, textual data is more granular, allowing for the identification of the categories of uncertainty, e.g. government spending uncertainty or monetary policy uncertainty, that better predict recovery rates. In contrast to previous studies Nazemi et al. 2017, we find that more parsimonious models that rely on well-documented recovery rate determinants outperform those making use of the entire set of predictors and those relying on data-driven variable selection. A limited number of economically grounded predictors also makes the model easier to implement and manage. Among those, uncertainty measures are particularly relevant for recovery rate prediction. Moreover, it makes its results more transparent, hence, more prone to regulatory validation. This is consistent with the regulatory requirements of IRB approaches (BCBS 2006(BCBS , 2017. Second, we provide the largest benchmark study of machine learning methods in the context of bond recovery rate prediction. We consider a total of 20 predictive algorithms, and we obtain similar conclusions to those of Bellotti et al. (2021) in the case of recovery rates for nonperforming loans: random forests, quantile random forests, boosted trees and cubist display the most promising performance and seem to be the best class of models. Bagged MARS and model-averaged neural networks also seem to be competitive.
Third, we empirically show that meta-learning can be used to improve recovery rate predictions compared to traditional machine learning methods while considerably reducing model risk. This evidence is preserved both when looking at the predictive performance within a chosen predictor set and when jointly considering predictions across all specifications of the predictor set. The proposed methods outperform competitive combining methods such as the equally weighted forecast and the hill-climbing algorithm of Caruana et al. (2004), which showed promising results in the field of credit risk scoring (for more details, see Lessmann et al. 2015). Furthermore, despite the main concern of this paper, which is recovery rate predictions, the idea of combining heterogeneous models trained using diverse predictors sets is not specific to credit risk modeling, and it can be used to hedge model uncertainty across the whole field of management science.
The remainder of the paper is structured as follows. Section 2 contains an overview of the machine learning algorithms involved in our meta-learning approach and explains the latter. Section 3 provides a description of the data. Section 4 describes the main results of our benchmark study. Section 5 discusses the practical implications implied by our results. Section 6 concludes the paper.

Methodology
An overview of our predictive strategy can be found in Figure 1. After specifying the bond recovery rate predictors, first-level learners are trained to minimize the mean square error (MSE). Subsequently, the fitted residuals of the first-level learners (meta-data) become the input of second-level learners (meta-learners), which combine the original models with the goal of making the aggregate forecast error variance as small as possible. Finally, we evaluate the predictive performance of the various classes of linear, nonlinear, rule-based and meta-learning methods. We briefly review the predictive algorithms and combination strategies that will serve as first-and second-level learners, respectively.

First-Level Learners
To undertake an unbiased benchmark study, the spectrum of competing models should be as rich as possible. This requirement has rarely been respected in the context of recovery rate modeling, except for the large-scale benchmark studies of Loterman et al. (2012) and Bellotti et al. (2021).
For comparison purposes, we used a similar set of techniques as Bellotti et al. (2021), who employ 20 algorithms belonging to three different classes: linear, nonlinear and rule-based.
We provide the list in Table 1 together with the corresponding R implementations. For more details on nonlinear and rule-based methods, we refer to Appendix A. 4 We consider seven linear models with and without penalization. Following Bellotti et al. (2021), they can be framed as the following minimization problem: where y denotes the vector of the observations in the sample, X is the N-by-p matrix of predictors and β stands for the vector of regression coefficients. Different specifications of the penalty term λ ≥ 0 and the mixing factor 0 ≤ α ≤ 1 yield the standard linear regression model (with and without backward selection), ridge, lasso and elastic net. Models of these types can only reproduce linear relationships such as those reproduced in Figure 2.   Examples of fitted relationships for linear, nonlinear and rule-based models. The true data generating process (dots) follows a shifted sine wave with Gaussian noise. Panel a represents the fit obtained by a linear regression. Panel b represents the fit obtained by support vector regression. Panel c represents the fit obtained by boosted trees with stochastic gradient boosting.

Nonlinear Models
We dealt with six types of nonlinear models. Among the kernel methods, we considered support vector regression (SVR), relevance vector machines (RVM) regression and Gaussian processes. We further considered multivariate adaptive regression splines (MARS) and two nonlinear ensembles: model-averaged neural networks and bagged MARS. We refer the reader to Bellotti et al. (2021) for an extended treatment of each of them. These models are naturally suited to capture nonlinear relationships, but they can be prone to overfit. An example of nonlinear fit is included in Figure 2.

Rule-Based Models
According to Gambetti et al. (2019), rule-based methods are able to identify clusters of data with similar properties and to reproduce step-like relationships (with or without slopes depending on the particular algorithm). We included seven types of rule-based methods. Individual models include regression trees and conditional inference trees, while ensembles are represented by cubist, random forests, quantile random forests and boosted trees with and without stochastic gradient boosting. An example of rule-based model fit is included in Figure 2.

Second-Level Learners
Starting from m predictions y 1 , . . . , y m of the random variable y returned by m firstlevel learners, two strategies can be adopted. The standard procedure consists of selecting the best forecast, say y := y i , where model i is selected according to some criteria. Alternatively, these predictions can be aggregated to form a single prediction y := φ( y 1 , . . . , y m ) using some function φ. Model combination offers diversification gains that makes it attractive when we cannot identify ex ante the best single model. 5 In addition, in the rare cases where the best model can be identified, meta-learning techniques could still take full advantage of the available information when the first-level learners rely on various data sources or cover a wide spectrum of modeling assumptions.
Meta-learning learns the combination strategy φ directly from the data with the explicit goal of minimizing a loss function. Precisely, let y be an n-dimensional column vector containing n observations of the target variable and Y be an n-by-m matrix of m unbiased candidate forecasts of y. The optimal combination strategy consists of estimating the function φ( Y) that solves where Φ := R n × R m → R n is some conformable functional space. Nevertheless, its success depends on how accurately the combination strategy can be determined. The use of in-sample data both to train the individual models and to consecutively combine them on the basis of their respective fitted residuals can significantly amplify the initial estimation error and consequently produce poor out-of-sample predictions.
Meta-learning relies on validation techniques to assess how well the combination weights will generalize to the out-of-sample predictive exercise. We opt to combine by minimizing the variance of the aggregate prediction error and include robust combination methods to hedge against potential instability. 6

Linear Meta-Learners
Restricting ourselves to the class of linear combinations with weight summing to one, i.e., for Φ being the set of functions from R m × R n → R of the form φ(x 1 , . . . , x m ) := ∑ m i=1 w i x i satisfying ∑ m i=1 w i = 1 leads to a combined prediction of the form y = ∑ m i=1 w i y i . In this case, the optimization problem becomes Constraining the weights to sum to 1 keeps the aggregate prediction unbiased provided that the candidate forecasts are also unbiased (which is the case in this paper). By additionally constraining the weights to be nonnegative, this approach corresponds to Breiman's stacked regressions (Breiman 1996), where linear combinations of different predictors are considered to further improve prediction accuracy. Granger and Ramanathan (1984) show that this approach is equivalent to selecting the weights w * that minimize the variance of the meta-learner's prediction error.
Nevertheless, this combination strategy can lack robustness when the covariance matrix of prediction errors is poorly estimated. This often occurs because of sample size limitations, considerable background noise or when first-level forecasts are highly collinear (Claeskens et al. 2016). We can tackle the potential instability of w * by using the linear shrinkage estimator of the covariance matrix of prediction errors, which we denote by Σ λ . The latter consists of combining the sample covariance matrix (which is easy to compute, asymptotically unbiased but prone to estimation errors) with an estimator that is misspecified and biased but more robust to estimation errors (Ledoit and Wolf 2004). This approach, initially derived in a portfolio optimization context, was recently transposed in Roccazzella et al. (2021) to the forecast combination problem, showing that constrained optimization with shrinkage (COS) of Σ can provide a single-step, fast and robust optimal forecast combination strategy. Here we adapt the COS to act as a linear meta-learner, which combines the first-level learners only on the basis of in-sample information. This leads to the following optimization problem: where λ ∈ [0, 1] is the shrinkage intensity, Σ is the sample covariance matrix of prediction errors and Σ is a predetermined reference covariance matrix. We estimate the optimal shrinkage intensity by minimizing the expected Frobenius norm of the difference between Σ λ and the population covariance matrix of prediction errors S (Ledoit and Wolf 2004). 7 We consider two shrinkage directions for Σ.

Definition 1 (COS-E-Constrained
Optimization with Shrinkage towards Equal weights). The target covariance matrix Σ COS−E corresponds to the case where first-level learners are assumed to have identical prediction error varianceσ 2 and identical pairwise correlation coefficientsρ: In practice, we setσ 2 = 1 m ∑ m i=1 σ 2 i , where σ 2 i is the sample prediction error variance of model i (the average error prediction variance in the set of first-level learners) and byρ = 2 m(m−1) ∑ m−1 i=1 ∑ m j=i+1 ρ i,j (the average correlation coefficient of first learners' in-sample prediction error).
Definition 2 (COS-IL-Constrained Optimization with Shrinkage towards Inverse Loss-based weights). The target covariance matrix Σ COS−IL corresponds to the case where first-level learners have an identical pairwise correlation coefficientρ, but their respective prediction error variance is estimated using sample data: As benchmarks, we also include the equally weighted forecast, i.e., φ(x 1 , . . . , 1 m x i and the hill-climbing algorithm of Caruana et al. (2004). In this latter case, forward step-wise selection is used to identify the equally weighted combination that minimizes the root mean square error (RMSE) in a validation fold sampled from the training data.

Nonlinear Meta-Learners
Thus far, we have described linear weighting schemes. Nevertheless, meta-learning is more general. For example, in the field of image classification, meta-learning techniques typically involve nonlinear methods such as deep learning or recurrent models (Santoro et al. 2016), metric learning (Koch et al. 2015) and learning optimizers (Ravi and Larochelle 2017). Despite being more flexible than linear methods, nonlinear meta-learners are also more complex and prone to overfit, especially in a relatively small data set. 8 For these reasons, we opt for ensembles of shallow (one hidden layer) feed-forward neural networks (NNs), which can still approximate any measurable function at any desired degree of accuracy provided that sufficiently many hidden units are available (Hornik et al. 1989). In this case, we consider Φ to be the set of functions from where k is the number of neurons in the hidden layer.
As for the linear combination schemes, we train the NNs to minimize the in-sample MSE using, as input, the first-level predictions. However, in this case, the output of the meta-learning is not a linear function of the inputs, and the combining weights are not constrained to sum to one. Therefore, studying the marginal contribution of each first-level prediction onto the aggregate prediction is not straightforward. In our empirical exercise, we consider two feed-forward neural networks with 1 (NN-1) and 2 (NN-2) hidden layers and 8 and 16-8 neurons, respectively.

Recovery Rates and Security-Specific Characteristics
The main references on bond recovery rate modeling are generally based on the Moody's Default & Recovery Database (Moody's DRD) or the Standard & Poor's Capital IQ Database. We employ Moody's DRD in this paper.
Following the standard market convention (Mora 2015;Schuermann 2004), the recovery rate of each bond is computed as the bond price measured 30 days after the default date, declared by the rating agency, and divided by the face value. To make our analyses more comparable with earlier studies such as Nazemi et al. (2017),  and , we apply a similar filtering strategy. We selected dollar-denominated bonds issued by U.S. companies and with at least USD 5 million in face value. To replicate the same economic conditions, we also filter for default issues in the period 2002-2012. We only retain the observations associated with known values for the following security-specific characteristics: debt seniority, issuer's industrial sector, default type, coupon level, maturity, presence of additional guarantees different from the issuer's asset and default date. We are thus left with 768 observations. Notice that machine learning methods provide useful insights even with a sample of this size. In particular, the considered predictive models mitigate the risk of overfitting, can identify the most relevant predictive variables and explore the existence of potential nonlinearities in the data (nonlinear and rule-based methods). Figure 3 includes a histogram of our recovery rate sample. We provide the corresponding summary statistics in Table 2. The average recovery rate is 30.98%, while the standard deviation is equal to 27.58, which is a large value compared to the mean. The distribution is also highly skewed, a typical feature of recovery rate data. Summary statistics of recovery rates conditional on the seniority of the defaulted bond, issuer's industrial sector and default type are provided in Tables 3-5, respectively. They are consistent with previous findings on recovery rate determinants. For instance, bonds with higher seniority are associated with higher recovery rates on average, and those of senior secured bonds are the most dispersed (Altman and Kalotay 2014;Altman and Kishore 1996). Recovery rates are also higher, on average, when the issuer is operating in an industrial sector featuring higher asset tangibility and in the utility sector in particular (Acharya et al. 2007;Altman and Kishore 1996;Schuermann 2004). Similarly, milder default procedures lead to higher recovery rates (Altman and Karlin 2009;Bris et al. 2006;Davydenko and Franks 2008;Franks and Torous 1994). Defaults on security cash flows are expected to recover more than company reorganizations or liquidations. Controlled reorganizations (prepackaged Chapter 11) also display higher recovery than uncontrolled reorganizations and asset liquidations (receivership and procedures included in the "others" category). Tables 6-8 show that our sample of recovery rates is also consistent with prior findings about the role of coupons, maturity and backing guarantees. Recovery rates are higher in the presence of backing guarantees, increase with coupons and decrease with maturity (Jankowitsch et al. 2014). All these security-specific characteristics have been used as control variables in recent articles on bond recovery rate determinants (François 2019;Gambetti et al. 2019) and are used as predictors in our study.  Table 8. Average recovery rate for bonds with and without backing.

Systematic Factors
We extract industry default rates from Moody's DRD and the remaining systematic factors from the databases managed by the Federal Reserve Bank of St. Louis: FRED and FRED-MD (McCracken and Ng 2015). The latter include a large set of time series referring to output and income, labor market, housing, consumption, orders and inventories, money and credit, interest and exchange rates, prices and the stock market. 9 We then extend this data set with a large number of economic uncertainty measures from three different classes: survey-based, news-based and volatility-based. 10 All measures are retrieved from the websites of the authors (Baker et al. 2016;Jurado et al. 2015;Ludvigson et al. 2019). We consider all five measures employed in the study by Gambetti et al. (2019) plus 11 additional measures of categorical economic policy uncertainty. Table 9 gives an overview. Factors relating to the market price of risk and industry portfolio returns are instead retrieved from the Fama-French database. Predictors are measured one month before the default date. We now proceed to present the results (Section 4) and to discuss their relevance with respect to practical implementation in the industry (Section 5).

Results
We compare the performance of the considered algorithms across nine specifications of the predictor set. All model specifications feature security-specific characteristics but differ in the systematic variables that are included. We use two approaches to determine the latter.
(a) Statistical approach: the algorithms can access either the full data set of systematic variables or a set created with variable selection techniques. For variable selection, we consider a model based on lasso-selected systematic variables and one based on lasso with stability selection (Meinshausen and Bühlmann 2010). While the latter has been used in  to check the reliability of their lasso-selected macroeconomic variables, those retained by lasso with stability selection have never been used to feed predictive algorithms. 11 (b) Economic approach: we create models by relying exclusively on well-identified factors based on the results of Gambetti et al. (2019) and prior studies on recovery rate determinants. Table 10 includes a summary of the model specifications.
We compare the considered predictive strategies by analyzing the root mean square forecast error (RMFE), and we also assess the significance of a performance difference via a model confidence set test (hereafter MCS, Hansen et al. 2011). 12

Predictive Models vs. Historical Averages
The performance of our predictive algorithms across model specifications is summarized in Figure 4; the detailed out-of-sample RMSE are reported in Table 11. Each line corresponds to a different algorithm. The solid horizontal line (in red) corresponds to the model where we use the in-sample mean recovery rate as a prediction; it can be assimilated to the regulatory standard approach and foundation IRB approach (BCBS 2006(BCBS , 2011, which are largely based on historical LGD values. Figure 4 highlights that forecasting recovery rates using an algorithm always leads to more precise estimates than using the average of previously observed values. The only exception is the linear regression, which, as expected, suffers from the curse of dimensionality when trained on the full set of predictors (specification 1). Table 11 and Figure 4 also make clear that, in line with the literature, models exploiting systematic variables (specifications 1 to 8) always yield better forecasting performance than a model based on security-specific factors alone (specification 9).

Models Based on Systematic Variables
For the selection of macroeconomic variables with data-driven methods, we observe from Table 11 that lasso selection seems to bring more benefit to linear models than to nonlinear or rule-based models. In particular, it appears that ensemble methods (neural networks, bagged MARS, boosted trees, random forests, quantile random forests, cubist) are deprived of useful predictive information when they are trained on lasso-selected variables (specification 2). The performance of these latter algorithms improves if we instead implement the lasso procedure with stability selection (specification 3). We obtain particularly competitive RMSE values in this latter case; the best performance of first-level learners reaches an RMSE of 0.174 with boosted trees. From an aggregate perspective, it is clear that models relying on lasso selection (with or without stability control) reduce the embedded model risk compared to using the complete set of predictors.
We find that lasso with stability selection retains financial uncertainty, uncertainty related to government spending and uncertainty related to sovereign debt and currency crises (Table 12). Financial uncertainty is associated with the highest probability of being selected (95%). The second and third factors, an index of consumption expenditures and a housing market indicator, are associated with probabilities of 88% and 86%, respectively. 13 Given these results, we confirm the importance of uncertainty measures for forecasting bond recovery rates.   Moreover, if we consider the features of data-driven selection methods against those of economically motivated models, we should prioritize the use of the latter. Notwithstanding that both methods exhibit comparable predictive performance, the economically motivated models offer several advantages. By being based on a low-dimensional set of well-identified economic factors, they are easier to implement and monitor and yield more interpretable results. In contrast, data-driven selection methods can fail to correctly select the best subset of predictors when the latter feature high correlation and require the specification of additional hyperparameters. This increases the complexity and the computational burden, with no guarantee of perfectly selecting the full set of significant predictors.

First-Level Learners
We now discuss the findings regarding the out-of-sample performance of our algorithms across model structures. In Figure 4, we highlight (in black) the performance of ensemble methods: model-averaged NNs, bagged MARS, boosted trees (with and without stochastic gradient boosting), cubist, random forests and quantile random forests.
In this respect, our results point in the same direction as what Bellotti et al. (2021) discover for nonperforming loans. Rule-based ensemble methods are always associated with the most promising performance compared to both individual learners (linear or nonlinear) and the two other ensembles (bagged MARS and model-averaged NNs). Nonetheless, even if they are less competitive than rule-based ensembles, bagged MARS and model-averaged neural networks outperform individual learners in almost all model specifications.
This can be explained as follows. First, the prediction errors of ensemble methods generally have a lower variance than those of individual learners. This effect is actually a consequence of the aggregation of several base learners; it is particularly visible when comparing the performance of MARS against that of its bagged version. Second, rule-based ensembles are better suited to capture subgroups of data with similar properties and to build a separate model for each group. Recovery rates are particularly prone to behave in this manner, as explained in Yao et al. (2015),  and Bellotti et al. (2021).
Third, rule-based methods are better suited to reproduce predictor-response relationships that are defined on a closed interval, as is generally the case for recovery rates.
It appears that the only individual learner that can compete with rule-based ensembles is the SVR. However, this method yields good performance only when trained using the full set of predictors or when systematic predictors are selected using lasso without stability selection (which corresponds to the framework adopted in Nazemi and Fabozzi (2018)).

Meta-Learning: Within and across Predictor Sets
Let us now analyze the performance of linear meta-learning algorithms within each specification of the predictors set (highlighted in Figure 4 and Table 11). We notice a sharp drop in the average RMSE metrics for almost all specifications compared to the other models. In fact, the architecture of these algorithms allows the creation of more flexible functional forms thanks to the combination of various first-level learners. The different strengths of first-level learners are merged together to yield more accurate forecasts. The performance of linear meta-learning algorithms can be compared to those of the best rulebased ensembles. Indeed, Table 11 shows that OPT+, COS-E and COS-IL always join the superior set of models with at least 10% probability outperforming the equally weighted forecast and the hill-climbing algorithm, which, nevertheless, remain competitive especially in specification 3, 4 and 5.
The variation of the aggregate errors within linear meta-learners is much lower than within the class of individual or ensemble models, and the same holds true for the maximum average loss. Therefore, model risk is sensibly reduced when relying on meta-learners compared to on both individual learners and ensembles. This feature can be observed from Figure 4 and holds across all model specifications. We also find that nonlinear metalearners (NNs) should generally be avoided. They usually display larger RMSE values than those of linear meta-learners across all specifications of the predictor set. Moreover, their partial advantage over ensemble methods and first-level learners in some specifications is neutralized by considerably higher errors in others (i.e., specifications 1, 8 and 9). Hence, in terms of model risk, ensembles and linear meta-learners should be preferred. 14 Nevertheless, in practice, we do not know ex ante which among the nine specifications of the predictors set will offer the best performance. While we have previously employed meta-learning to learn the predictive setup within each specification, we now use the same tools to combine all models across all specifications. In other words, we now jointly consider 180 base learners (20 models times 9 predictors specifications). Table 13 displays the RMSE, mean absolute error (MAE) and R 2 of the best 20 predictive strategies in this exercise. Boosted trees (specification 2, 4, 6, 7), quantile random forest (specification 3) and random forest (specification 3) are the only base learners that join the superior set of models. However, their performance clearly depends on the specification of the predictors set. Among meta-learners, while Opt suffers from the increased dimensionality of the problem and performs poorly, meta-learners with nonnegative combining weights (Opt+, COS-E and COS-IL) occupy three of the top four places and join the superior set of models, hence mitigating the uncertainty related to the choice of both the predictors set and the modeling framework, i.e., individual vs. ensembles, linear vs. nonlinear vs. rule-based methods. The results do not significantly differ when analyzing R 2 instead of RMSE; this is not surprising given the close relationship between the two metrics. The ranking remains coherent with previous results, especially among the top-performing methods: linear metalearning techniques are still at the top, beaten only by boosted trees in specification 7, i.e., when inflation uncertainty and federal/state/local expenditures uncertainty are considered together with industry default rate, commercial and industrial loans delinquency rates, industrial production, market index returns, PMI and individual characteristics. Nevertheless, this difference in performance is not statistically significant: COS-E, COS-IL and Opt+ join the superior set of models with 10% confidence level. Conversely, equally weighted forecast and the hill-climbing algorithm perform poorly and are both excluded from the superior set of models. The main difference that emerges when considering the MAE is that quantile random forests (QRF) outperform their competitors, on average. This should not come as a surprise: QRF are trained to minimize the MAE to mitigate the impact of outliers on estimation error, while the loss functions of other predictive strategies are instead related to the RMSE. Overall, COS-E, COS-IL and Opt+ still provide very competitive MAE, while, at the same time, mitigating the uncertainty related to the choice of both the predictors set and the modeling framework. Table 13. Ranking of top 20 predictive frameworks according to out-of-sample root mean square error (RMSE) metrics across all specifications of the predictors set. The symbols *, ** and *** mark 1%, 5% and 10% significance levels of the model confidence test with L2 loss, respectively. For the sake of comparison, we also display the results corresponding to the equally weighted forecast and hill-climbing algorithm, despite ranking, respectively, 24th and 37th.

Discussion and Practical Considerations
The results of the previous subsections convey important practical considerations that we now summarize.
First, financial institutions should accelerate the implementation of LGD internal models instead of maintaining the use of historical averages. We show in Section 4.1 that the predictive model does not need to be complicated to be effective. All model specifications outperform the historical average approach, which, however, is the method underlying the standard and foundation IRB frameworks. Unfortunately, recent regulatory guidelines (BCBS 2017) are now pointing in the opposite direction, favoring fixed recovery rate approaches. Our results suggest that this is perhaps not the best route to follow, as LGD internal models can produce more reliable risk figures.
Second, in Section 4.2, we confirm the necessity of including economic factors that can capture systematic fluctuations in recovery rates. This result is in line with the current regulatory prescriptions. Models relying on well-identified economic determinants provide remarkable results, while big data and variable selection procedures do not necessarily translate into better performance because results strongly depend on the considered predictive algorithms. We found that financial uncertainty and text-based economic uncertainty measures are relevant predictors of the recovery rate ex ante. This extends the finding of Gambetti et al. (2019), where financial uncertainty and the Economic Policy Uncertainty Index were found to explain ex post recovery rates. Similarly, we also confirm the finding of : inflation measures, inflation expectation, industrial production, corporate profits after tax, indicators of housing market and stock market volatility are not only systemic determinants of recovery rates but also important predictors. This is also confirmed when using lasso with stability selection (Table 12): the probability of including financial uncertainty and text-based economic policy uncertainty measures is particularly high. Specifically, financial uncertainty, economic policy uncertainty/government spending, economic policy uncertainty/sovereign debt currency crises and economic policy uncertainty/trade policy have 95%, 50%, 40% and 29% probability of being selected, respectively. Including such predictors in more parsimonious and theoretically justified models makes the predictive setting easier to implement, interpret and backtest, making it more prone to be validated by regulators.
Third, in a context where increasing amounts of data become publicly available, the common practice of using standard linear regression to forecast recovery rates should be abandoned. In Section 4.1, we confirm that this is incompatible with the possibility of exploiting the information contained in a large data set of candidate predictors and embeds high model risk. This result is in line with the recent findings of Dong et al. (2020) in a similar financial context.
Fourth, we show in Section 4.3 that ensemble methods and meta-learning techniques should be fostered in the industry. The choice of the A-IRB model should be directed toward ensemble methods with respect to individual learners, as they always yield better performance. However, an incorrect choice among ensembles may still lead to worse results than an incorrect choice among meta-learning techniques. Hence, meta-learning considerably reduces model risk compared to ensembles; it should be considered by the regulator for a new set of A-IRB guidelines. This is particularly true for linear metalearning techniques that also have the advantage of providing interpretable insights about the contribution of each individual model. As they show, the benefits of diversification can already be appreciated by combining a limited number of high-performing algorithms.

Conclusions
In this paper, we explore for the first time the applicability and the potential of metalearning in order to forecast bond recovery rates. Meta-learning consists of combining the predictions arising from several first-level algorithms into a single aggregated forecast. More specifically, our purpose is to test the performance of a wide set of machine learning methods for predicting recovery rates on defaulted corporate bonds using a data set of predictors of unprecedented size and see whether combining them might display superior performance. We consider as first-level algorithms both individual and ensemble methods belonging to the classes of linear, nonlinear and rule-based models.
We contribute to the literature on bond recovery rate prediction in three ways. Our first contribution deals with the set of predictors to use. We find that including a limited number of well-identified recovery rate determinants and economic uncertainty yields better predictive performance than using the full set of macroeconomic predictors, even when combined with variable selection techniques such as lasso. This finding prevails in all considered algorithms and contrasts with the current big data trend. The use of a limited number of economically grounded predictors offers the additional advantages of making the model easier to implement, to backtest and therefore, when applicable, to be validated by regulatory instances because of mitigating the risk of overfitting and data mining. Interestingly, we confirm the central importance of uncertainty measures that are always retained by the considered variable selection methods.
The second and third contributions relate to the predictive method to use. Random forests, quantile random forests, boosted trees and cubist display the most promising results, but their performances are unstable across the considered specifications of the predictors set: it seems difficult to identify ex ante the right pair of model and predictors set to use. By contrast, we empirically show that meta-learning can beat this challenge. Indeed, meta-learning can improve recovery rate predictions compared to traditional individual learning machines while, at the same time, considerably reduce model risk. This evidence is preserved both when looking at the predictive performance within a chosen predictor set and when jointly considering predictions across all specifications of the predictor set.
In all specifications, the historical average approach performs significantly worse. Yet, this is the method underlying the standardized approach and foundation IRB framework used to compute regulatory capital reserves, which is the primary figure used worldwide to monitor the credit worthiness of banks. Therefore, our findings suggest that regulators and policy makers promoting the use of more reliable risk figures should foster the implementation of LGD internal models that use meta-learning techniques, while the use of traditional linear regressions or historical averages should be challenged. Eventually, our findings will be of practical interest to design new tools for internal economic capital calculations as well as for stress testing purposes.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Details on Nonlinear and Rule-Based Methods
Following Bellotti et al. (2021), we give more details on the nonlinear and rule-based methods we have considered in this study. We refer to Hastie et al. (2009) for a textbook treatment of the predictive algorithms.

Multivariate Adaptive Regression Splines (MARS)
The multivariate adaptive regression splines method of Friedman (1991) features piecewise transformations of the original predictive variables. Specifically, given a set of cut points t, each predictor X j is transformed into a set of reflected pairs obtained by: where j = (1, 2, ..., p) and t ∈ {x 1j , x 2j , ..., x Nj }. A linear regression model is then estimated following a greedy procedure on the reflected pairs that are selected for each predictor. The number of selected pairs for each predictor defines the degree of the MARS algorithm. A backward pruning procedure handles overfitting by dropping the features that are associated with the smallest error rate when excluded. In this study, we consider MARS of degree one, and the number of terms to be retained in the final model is estimated via 10-fold cross-validation.

K-Nearest Neighbors
K-nearest neighbors is a non-parametric method that forecasts the target variable by using the average of the K training observations that are closest in the covariate space. Closeness between observations is defined by the Euclidean distance. In this paper, we tune the size of the neighborhood K via 10-fold cross-validation.

Model-Averaged Neural Networks
Model-averaged neural networks (Ripley 1996) is an ensemble of feed-forward neural networks where the base learners use different initial values for the optimization procedure to estimate the parameters. In addition to reducing the model risk via averaging the multiple predictions of the base learners, the random initialization of the parameters mitigates the risks of converging in local optima when dealing with a backpropagation algorithm. Individual neural networks can also include a weight decay that penalizes large coefficients further mitigating the risk of overfitting. In this paper, we employ an ensemble of five neural networks, and we tune the number of hidden units as well as the weight decay via 10-fold cross-validation.

Support Vector Regression, Relevance Vector Machine and Gaussian Processes
Support vector regression (SVR) (Vapnik 1995) is a kernel-based model estimated from the following minimization problem: where L is an -insensitive loss function, and C is the penalty assigned to residuals greater or equal to . The solution to this problem (A2) can be expressed in terms of a set of weights w i and a positive definite kernel function K(·) that depends on the training set data points: Training observation associated to non-zero weights are called support vectors, and they are used to estimate the model. We consider the radial basis kernel K(x, x ) = exp(−σ x − x ), and we estimate the penalty C via 10-fold cross-validation while the scaling parameter is σ, following Caputo et al. (2002).
Relevance vector machine (RVM) (Tipping 2001) is a kernel method similar to SVR, but the weights are now computed using a Bayesian framework. This allows for a probabilistic interpretation of the model predictions, and it offers the additional advantage of producing sparser models than standard SVR. Williams and Rasmussen (1996) introduce Gaussian processes, a non-sparse nonparametric generalization of the RVM framework. Gaussian processes impose a prior distribution directly on the function values and consider Gaussian distribution with zero mean and covariance matrix equal to the kernel matrix K ij = K(x i , x j ). In this work, we implement Gaussian processes using a radial basis kernel.

Regression Trees
Regression trees (Breiman et al. 1984) partition the predictors' space into a set of nonoverlapping regions and fit a model in each of them. In their simplest version, predictions are formed using the average of the target variables associated with each region. The algorithm employs a top-down partitioning to estimate the regions, starting with the full dataset and sequentially splitting it into groups according to the predictors and cut points that achieve the largest decrease in the residual sum of square errors. The splitting procedure stops when a certain criterion is met, e.g., the number of observations in the terminal nodes. To limit overfitting, the tree is then pruned back by using cost and/or complexity criterion, and the amount of regularization can be tuned via cross-validation.

Conditional Inference Trees
The conditional inference trees algorithm (Hothorn et al. 2006) was proposed to overcome the potential selection bias of regression trees. In fact, in the standard regression trees algorithm, there is a greater chance of selecting the predictors featuring a higher number of candidate cut points during the tree growing step. For each predictor, conditional inference trees use statistical hypothesis testing to evaluate the difference between the means of the two samples created by a candidate split. Multiple comparison corrections reduce the selection bias for highly granular predictors (Westfall and Young 1993). In this work, we estimate the p-value threshold, determining whether to consider a new split, and the tree maximum depth via 10-fold cross-validation.

Bagged Trees and Random Forests
Bagged trees (Breiman 1996) is an ensemble method resulting from the aggregation of the predictions of multiple regression trees estimated on bootstrapped samples of the training set. In this work, we tune the number of trees composing the ensemble by 10-fold cross-validation.
Random forests (Breiman 2001) is an evolution of the bagged trees algorithm that was motivated by overcoming the problem of high correlation between individual trees. In fact, for bagged trees, all predictors are considered at each split during the tree-growing process, which results in trees grown in very similar structures. The random forests algorithm instead considers a subset of randomly selected predictors at each split. This reduces the correlation between the trees and the variance of the ensemble prediction. We tune the number of randomly selected predictors via 10-fold cross-validation.

Boosted Trees
Boosted trees is an ensemble algorithm where the individual trees are sequentially fitted by multiple weak learners, and to avoid overfitting, only a percentage of each fitted value (called learning rate) is subtracted from the residual from the previous learner. The number of boosting iterations, i.e., the number of trees, the learning rate and the individual trees' depth, are the main hyper-parameters for this model. Stochastic gradient boosting is a version of the algorithm that includes a random sampling scheme of the training data at each iteration to improve computational efficiency and further mitigate the risk of overfitting. In this work, we tune the hyper-parameters via 10-fold cross-validation.

Cubist
Cubist (Quinlan 1993) combines the M5 model tree of Quinlan (1992) with some features from boosting and K-nearest neighbors algorithms. Cubist has a tree structure where each node contains a linear regression model whose covariates are the same variables that satisfy the rule defining a specific node. After the model is estimated, the predictions in each node are recursively smoothed by using the fitted values from the respective parent node. Smoothing consists of linearly combining the two models, with the one that has the smallest RMSE having the largest weight. To limit overfitting, the rules can be pruned using the adjusted error rate criterion as in M5. Cubist can also adjust the forecast using a weighted average of sample neighbors. Committees, i.e., ensembles, can also be created using multiple model trees in a boosting-like framework. In this work, we tune the number of neighbors and the number of committees via 10-fold cross-validation.

Appendix B. A Closer Look at Uncertainty Measures
Following Gambetti et al. (2019), we provide more details on the financial and economic uncertainty measures employed in this study.
We consider two measures of financial uncertainty: (a) stock market volatility and (b) the financial uncertainty index of Jurado et al. (2015). On the one hand, the stock market volatility-based measure of financial uncertainty included in our data set is represented by the stock market implied volatility index VIX, retrieved from the Chicago Board Options Exchange database. It has monthly frequency, and it is also included in the FRED monthly database. On the other hand, we select the one-month horizon financial uncertainty indicator of Jurado et al. (2015) and Ludvigson et al. (2019) that is instead based on forecast error volatility. This latter method identifies uncertainty with unpredictability, i.e., the more or less the macroeconomic series have become predictable, the less or more uncertainty agents face. Specifically, Jurado et al. (2015) define h-period ahead uncertainty in the variable Y to be the conditional volatility of the purely unforecastable (at horizon h) component of the future value of the series. Data was retrieved from the authors' website. Uncertainty measures can also be based on text analysis. This is the case of the economic policy uncertainty index and its sub-components from Baker et al. (2016). The series, available monthly, are retrieved from https://www.policyuncertainty.com/ (accessed on 28 April 2022) and include a range of sub-indexes based solely on news data. Specifically, these are extracted from the Access World News database of over 2000 US newspapers and measure coverage frequency. For example, each sub-index measures the coverage frequency of economic, uncertainty and policy terms as well as a set of categorical policy terms, e.g., monetary policy, tax government spending or trade policy (see the news-based measures in Table 9 for the complete list), in the pool of articles under analysis. For instance, articles that fulfill the requirements to be coded as economic policy uncertainty and contain the term 'federal reserve' would be included in the monetary policy uncertainty sub-index. For more details, please refer to Appendix B of Baker et al. (2016). We also consider survey-based measures that are available at https://www.policyuncertainty.com/ (accessed on 28 April 2022). The first, drawing on reports by the Congressional Budget Office, reflects the number of federal tax code provisions set to expire over the next 10 years. The second, relying on the Federal Reserve Bank of Philadelphia's Survey of Professional Forecasters, measures the dispersion between individual forecasters' predictions about future levels of the Consumer Price Index, federal expenditures and state and local expenditures among economic forecasters as a proxy of uncertainty about policy-related macroeconomic variables.

Appendix C. Economic Data
The list of the variables considered in this study is provided below. We indicate with ∆ the differencing order and with log the natural logarithm operator. The column Tcode denotes the following data transformation for a series x: (1) no transformation; (2) ∆x t ; (3) ∆2x t ; (4) log(x t ); (5) ∆log(x t ); (6) ∆2log(x t ); (7) ∆( x t x t−1 − 1). For more details on the calculation of the variables denoted with (*), please refer to the data appendix of Jurado et al. (2015) . We define model risk from three perspectives: (i) maximum average loss across model specifications and model classes, (ii) average loss and (iii) its variability within each model class.

2
A weak learner is any machine learning algorithm that provides an accuracy slightly better than random guessing.
3 Models based on economic principles approximate the latter using industry default rates, loan delinquency rates, market and industrial production returns and recession indicators (Altman et al. 2005;Gambetti et al. 2019;Jankowitsch et al. 2014;Mora 2015). 4 Hyperparameters for first-level learners are tuned using 10-fold cross-validation in the training sample. Folds are created using stratified sampling based on seniority type, as in Nazemi et al. (2017),  and . The same applies for generating the training and test sets, with proportions of 70% and 30%. 5 Forecast selection outperforms the forecast combination only in very specific situations that are typically not encountered in practice: for instance, when the variance of the prediction errors of one model is lower than those of the others by several orders of magnitude, see, e.g., Roccazzella et al. (2021). 6 Another strategy consists of using an additional validation fold (Wolpert 1992). This has the drawback of extending the original data with potentially informative observations that would unevenly boost the performance of meta-learning techniques with respect to those of individual models and ensemble methods. In this paper, we empirically show that combining schemes that do not rely on additional sample splitting perform remarkably well compared to the a posterior best predictive framework. This is surprising especially for combination schemes whose weights are estimated using the same in sample information. 7 For further details on the COS methodology and for the explicit formula to estimate the optimal shrinkage intensity, we refer to Roccazzella et al. (2021). 8 For example, Dodge and Karam (2016) documents that deep learning methods are particularly sensitive to noise levels in image classification tasks. 9 We refer to Appendix C for the full list of the variables and the transformations performed on the raw data. 10 The reader can refer to Appendix B for more details and to Gambetti et al. (2019) for a detailed literature review on the topic. 11 We apply lasso with stability selection based on the R implementation stabs by Hofner and Hothorn (2017). We determine the dimension of bootstrapped lasso models using pointwise control (Meinshausen and Bühlmann 2010). Moreover, we specify a threshold of 0.6 for the selection probability as in . 12 The MCS tests whether a subset of methods enters jointly in the superior set of models by repeatedly testing the null hypothesis of equal predictive performance with significance level α. Let M 0 be the set of all forecasting models (both individual candidates and forecasts combinations), and let M * be the superior set of models. Formally, the MCS tests H 0 : E d i,j = 0, ∀i, j. If the null hypothesis is rejected, then the procedure eliminates the model with the greatest relative loss from the set M. This procedure is sequentially repeated until the null hypothesis is not rejected at the chosen probability level α. We compute the MCS p-values via bootstrapping (10000 replications) and using the Oxford MFE Toolbox publicly available at https://www.kevinsheppard.com/ code/matlab/mfe-toolbox/ (accessed on 28 April 2022) . 13 We find our list of selected variables to be largely consistent with those highlighted in . A table of predictor probabilities is included in the Appendix C.