1. Introduction
Based on the Basel II and Basel III Accords [
1,
2] banks in the EU need to keep a specified amount of capital to reduce the impact of their insolvency. The banks may choose one of the two following options: the Standardized Approach or the Internal Rating Approach. The Internal Rating Approach enables institutions to develop their own models for the purpose of calculating the required equity for credit collateralization. As part of the Basel II and Basel III Accords, three parameters were determined. The first parameter is probability of default (
PD), the next one is loss given default (
LGD) and the last one is exposure at default (
EAD). All three parameters play an important role in the measurement and management of credit risk.
In the beginning, researchers and practitioners were focused only on the first parameter, probability of default; however, in recent years, more attention has been paid to loss given default [
3]. It can be stated that if the
LGD is not determined correctly, the bank needs to hold more capital and, at the same time, less money may be devoted to investment. The correct estimation of
LGD may lead to a healthier and riskless allocation of capital.
It can be observed that one of the most popular models for
LGD modeling is Ordinary Least Squares (OLS) [
4]. When considering a very specific distribution of
LGD parameters, the beta distribution model also is considered [
5]. Non-parametric methods (such as regression trees or neural networks) are also considered in the modeling of
LGD values. However, it should be noted that most research papers consider only the prediction of the mean, without considering the entire
LGD distribution. Since the
LGD values are usually associated with full loss or full recovery, the
LGD distribution has a bimodal shape and predicting the mean value may be misleading. To overcome this fact, quantile regression was adopted to model
LGD distribution [
3,
6].
However, any “single” model is often less accurate than an ensemble of models, which may lead to inferior out-of-sample performance. Ensemble models constitute a relevant function in data analytics and can be created in a variety of ways. Alongside such techniques like bagging, boosting and simple voting, one can employ the Random Forest algorithm (RF). The purpose of RF is to combine forecasts of several base models (decision trees) in order to improve reliability and generalizability over a single model [
7]. There is strong evidence that a single model can be outperformed by an ensemble of models combined to reduce bias, variance or both.
For regression problems, Random Forests give an accurate approximation of the conditional mean of a response variable. It can be shown that RF provides information about the full conditional distribution of the response variable (such as
LGD), not only about the conditional mean. Conditional quantiles can be inferred with quantile Regression Forests (qRF), a generalization of standard RF [
8]. Quantile Regression Forests give a nonparametric and accurate way of estimating conditional quantiles for high-dimensional predictor variables.
As already mentioned, ensemble algorithms aim to combine many diverse base predictive models; however, there is always a question what influence each model should have on the final prediction. The simplest approach is to give equal weight to each learner. However, numerous studies have shown that a weighted ensemble can provide better forecasting results [
7]. For this reason, this research presents a novel weighted quantile Regression Forest algorithm (wqRF), which assigns higher weights in quantile estimation to those base learners with better generalization abilities. To prove the effectiveness of our methodology, we compare the proposed approach with other state-of-the-art benchmarking methods, including ordinary least squares, quantile regression, and standard quantile Regression Forests.
In light of the aforementioned reasons and motivations, using the dataset collected by one of the biggest Polish banks [
9], we aim to answer the following research questions:
To what extent is it possible to model and predict future values of the loss given default?
Does the proposed weighting method introduce improvements to the standard quantile Regression Forest algorithm?
Which modeling method best determines the loss given default in terms of the performance on a new unseen dataset?
The remainder of this paper is organized as follows:
Section 2 provides an overview of similar research problems in relation to loss given default modeling, machine learning, the Random Forest algorithm and weighting methods. In
Section 3 and
Section 4, the theoretical framework of the bimodal distribution and the weighted quantile Regression Forest algorithm is presented.
Section 5 outlines the experiments and presents the discussion of the results. The paper ends with concluding remarks in
Section 6.
3. Bimodal Distribution
Unimodal distribution is one of the most popular assumptions used in empirical modeling. Unimodal means that the given distribution has only one mode [
48] and a typical example of unimodal distribution is normal distribution (see
Figure 1). Moreover, normal distribution can be classified as symmetric distribution, but in many empirical analyses, symmetric assumption is too strong and other asymmetrical distributions are utilized (for example, Snedecor’s F distribution).
In empirical analyses, analyzed data are frequently bimodal and cannot be modeled by unimodal distributions [
49]. Bimodality means that a given distribution has two modes and a large proportion of observations with large distances from the middle of the distribution [
50]. It may provide important information about the nature of the analyzed variable (for example, the polarization of opinions if the variable represents a preference). Additionally, bimodality may indicate that the analyzed sample comes from two or more “overlapping” distributions or the analyzed sample is not homogenous (see
Figure 2). In such a situation, a deep dive analysis should be performed to discover the reason for bimodality and, in the case of bimodal distributions, summary statistics such as median and mean can be deceptive and measures such as kurtosis and standard deviation will be extremely large in comparison to unimodal distribution. The extension of bimodal distribution creates multimodal distribution.
Different bimodal datasets were presented by many authors: Chatterjee et al. [
51], Famoye et al. [
52], Bansal [
53]. Bimodal distribution comes from a mixture of two different unimodal distributions and one of the most popular bimodal distributions is two-component normal mixture distribution [
54]. The probability density function of a mixture of two-component normal distributions is as follows [
55]:
where, for
:
with
. The density
may have more than one mode. Additionally, a mixture of two-component normal distributions is closely related to the Exponential Power Family [
54]. Mixture modeling has been a favorable model-based technique in supervised and unsupervised clustering problems [
56].
Moreover, bimodality has an important impact on econometric analysis. One of the crucial decisions in the assessment of econometric models is statistical significance. This can be assessed using one of the most popular tests, Student’s
t-test, which assumes that the distribution of the modeling variable is normal. It implies that the modeling variable has one mode [
57]. In the case of modeling variables with two or more modes, Student’s
t-test may be misleading.
4. Weighted Quantile Regression Forests
The Random Forest algorithm introduced by Breiman [
24] works by constructing many decision trees and outputting the prediction of the individual trees by utilizing a training sample (training dataset
) of
observations, target variables
(where
is an empirical realization of
) and
predictor variables
, where each feature
can take a value from its own set of possible values
(moreover,
is an empirical realization of
). The main objective of each decision tree is to find a model (
) for predicting the values of
from new
values [
7]. In theory, the solution is simply a recursive partition of
space into the disjointed rectangular subspace of
(eventually achieving a final node called the leaf, denoted as
) in such a way that the predicted (
) value of
minimizes the total impurity of its child nodes (it is usually assumed that each parent node has two children, i.e., binary tree are considered). One of the first and widely used decision tree algorithms is the classification and regression tree (CART) [
58], employing a measure of node impurity based on the distribution of the observed
values in the node by splitting a node that minimizes the total impurity of its two child nodes, defined by the total sum of squares [
7]:
where
denotes the average value of vector
over all observations belonging to a particular node. The process is applied recursively to the data in each child node. Splitting stops if the relative decrease in impurity is below a pre-specified threshold.
The prediction of a single tree
for a new data point
is obtained by averaging the observed values in
-leaf. Let the weight vector
be given by a positive constant if observation
is part of
-leaf and zero if it is not. The weights add up to one, and thus [
8]:
then, the prediction can be computed as the weighted average of the target values
:
There are various stopping criteria controlling the growth of the tree such as the minimum number of observations that must exist in a node in order for a split to be attempted, the minimum number of observations in any terminal node (leaf), or the maximum depth of any node of the final tree [
7]. To achieve good generalization ability (i.e., small error rate on the unseen examples), the tree first grows in an overly large size and then it is pruned to a smaller size to minimize the misclassification error. CART employs a
-fold (default) cross-validation [
7].
The training algorithm for RF applies the general technique of bootstrap aggregating [
38], also called bagging, to the base learners (decision tree). Given a training dataset
of size
, bagging generates
new training sets
, each of size
, by sampling from
uniformly and with a replacement [
7]. By sampling with a replacement, some observations may be repeated in each
. If
, then, for large
, the dataset
is expected to have the fraction
of the unique examples of
, with the rest being duplicates. The
(
) models are fitted using the above
bootstrap samples and combined by averaging [
7]:
This bootstrapping procedure leads to better model performance because it decreases the variance in the model without increasing the bias. This means that although the predictions of a single tree are highly sensitive to noise in its training dataset, the average of many trees is not as long, as the trees are not correlated.
The above procedure describes the original bagging algorithm for decision trees. The RF algorithm has an additional modification, i.e., it uses a modified algorithm, called a Random Decision Tree, that selects a random subset of the features
at each candidate split in the learning process. This process is sometimes called feature bagging [
7]. The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors of the response variable (target output), these features will be selected in many of the
trees, causing them to become correlated. Typically, for a regression problem with
features,
features are used in each split [
7].
Each tree within the forest is built to its maximum size, i.e., without pruning. The evaluation of the model performance on the training dataset is often replaced by an out-of-bag sample. This is a method of measuring the prediction error on the remaining
observations not observed in the bootstrap sample (INB). OOB is the mean prediction error using only the trees that did not have a particular observation in their bootstrap sample [
7].
According to Equation (7), it can be shown that Regression Forests approximate the conditional mean
. It can be suspected that RF delivers not only a good approximation of the conditional mean, but also that it does this to the full conditional distribution. The conditional distribution function of
, given
, is given by:
where
is the indicator function, which, for the Regression Forests, gives an approximation of any parameter (e.g., quantile) of the conditional distribution:
Based on the above equation and for a given a probability
, we can estimate the quantile
as:
The quantiles give more complete information about the distribution of as a function of the predictor features than the conditional mean alone.
The original quantile Regression Forests implementation aggregates tree-level results equally across trees. In this article, we present the implementation of the standard qRF algorithm to build trees in the forest, but we use performance-based weights for quantile estimation, so that trees with better performance weigh more:
Since weights are based on the performance of a certain tree, applying weights to the same dataset from which they were calculated would bias the prediction error assessment [
7]. Because of this, the predictive performance of each tree is assessed using observations from OOB samples as a weighted root mean square error (
):
It should be noted that some observations are more difficult to predict than others. Due to this, some learning algorithms (e.g., AdaBoost) incorporate observation weighting [
59] and, for the same reason, Equation (12) contains weight
associated with each observation. Observation weighting can be also considered as a fairness problem [
60], which is by definition the absence of any prejudice or favoritism towards an individual or a group based on their intrinsic or acquired traits in the context of decision making. In qRF, some trees may have a better performance on an OOB sample because they have been trained on the same observations, i.e., the INB sample. Such observations are difficult to correctly predict by trees that did not have them in the INB used for training [
7]. Due to this, to assess the performance of each tree, we introduce weight based on the observations in the out-of-bag sample:
where
stands for a set where
-th observation from
-th base tree belongs to the out-of-bag sample. For example, if
(meaning that each tree correctly predicts the true value), the observation is practically omitted.
Finally, to calculate the final weights used in Equation (11), we incorporate the ranked weights approach, comprised of two stages [
7]:
For better understanding of this approach let us assume that we have a list of
prioritized (ranked) criteria, where each criterion
has a rank
(
). The goal is to select and rank a set of
criteria that seems to be relevant, giving each
-th criterion a rank
. The rank is inversely related to weight, which means that the first rank
denotes the highest weight (best tree), whilst rank
denotes the lowest weight (worst tree). Many authors suggest various approaches for assigning weights based on a given criterion, e.g., rank reciprocal (inverse), rank sum (linear) and rank exponent weights. In this paper, we assume that weights should be squared [
61]:
where
is the rank of the
-th tree. The complete weighted quantile Regression Forest pseudocode is summarized in Algorithm 1 (below).
Algorithm 1. Weighted quantile Regression Forest algorithm pseudocode. |
input: Number of Trees (), random subset of the features (), training dataset (), probability for quantile estimation |
output: weighted quantile Regression Forests () |
1: is empty |
2: for each = 1 to do |
3: = Bootstrap Sample () |
4: = Random Decision Tree (, ) |
5: = |
6: end |
7: for each = 1 to do |
8: Compute using Formula (13) |
9: end |
10: for each = 1 to do |
11: Compute using Formula (12) |
12: end |
13: for each = 1 to do |
14: Compute using Formula (14) |
15: end |
16: Compute final prediction using Formula (11) |
17: return |
6. Conclusions
Loss given default is an important component in the Basel Accords. Using more advanced methods in comparison to those currently used gives the bank the possibility to calculate their regulatory capital in a more accurate way and may lead to a healthier and riskless allocation of capital. Findings from this research can help financial institutions when estimating
LGD under the Internal Ratings Based Approach of the Basel Accords in order to estimate the downturn
LGD needed to calculate the capital requirements. Most analytical parametric and non-parametric methods (such as ordinary least squares, decision trees, artificial neural networks, etc.) used for
LGD modeling consider only the prediction of the mean, and do not consider the entire
LGD distribution. This causes a serious problem since the empirical
LGD distribution reflects the bimodal distribution, with peak values at the opposite sides, i.e., full loss (one value) or full recovery (zero value). To overcome this fact, this paper presented the weighted quantile Regression Forests model, built based on the data from one of the biggest banks in Poland. With our analysis, we confirm that predicting the future values of the loss given default is feasible and can be achieved with reasonable accuracy compared with the base models (first research question). This statement is supported by the results presented in
Table 4 and
Figure 5 and
Figure 6. Our answer to the second research question is positive as well. The results proved that the weighted quantile Regression Forests are able to predict future
LGD values with better accuracy than the benchmarking standard quantile Regression Forests. Finally, we have also empirically confirmed that both qwRF and qRF outperform other benchmarking methods including quantile regression and ordinary least squares models (third research question). The contributions of this article can be summarized as follows:
Systematization of the knowledge regarding bimodal distribution, loss given default modeling and attempts to improve the Regression Forest algorithm;
Application of the various modeling methods for loss given default problem;
Incorporating of the weighting procedure in quantile Regression Forests.
Future work will be conducted in two ways, i.e., with a focus on business perspective and modeling methodology. In the first path, a similar analysis should be performed using other banks’ products. Our comparison was performed using only one product, the overdraft, so an interesting challenge would be to perform a similar analysis using another bank product. The incorporation of other macroeconomic variables should be thoroughly analyzed to provide insights about the analyzed models. In the second path, future work in this area should include extending this weighted ensemble framework to other modeling problems, including simple regression and multiclass classification. Moreover, we will try to apply a similar concept to other ensemble creation methods. Lastly, we intend to further investigate the performance and default settings of parameters in the context of the deviation and variance of the base models, with (potentially) both theoretical and empirical analyses.