We use a weighted logistic regression to model the LGD for the defaulted account including all available data. The actual loss experienced is transformed to a binary format related to the “fuzzy augmentation” technique commonly used to introduce “rejects” in scorecard development (
Siddiqi 2006). This means that each observation has both a target of 1 (Y = 1) as well as a target of 0 (Y = 0). Furthermore, a weight variable is created, where the sum of the weights of these two events adds up to the full exposure of the account at observation. This is related to
Van Berkel and Siddiqi (
2012) who used a scorecard format for modelling LGD. This newly proposed methodology for LGD2 only considers worked-out accounts. A worked-out account can either cure or be written off. Note that the point of write-off is taken as that specific point where the institution (e.g., bank) no longer expects any recovery. This is specifically prescribed by the reporting standard: “IFRS 7 (35F) (e): The Group writes off financial assets, in whole or in part, when it has exhausted all practical recovery efforts and has concluded there is no reasonable expectation of recovery. Indicators that there is no reasonable expectation of recovery include (i) ceasing enforcement activity and (ii) where the Group’s effort to dispose of repossessed collateral is such that there is no reasonable expectation of recovering in full” (
PWC 2017). In effect, with our methodology, all write-offs and cures are included regardless of the time spent in default and no filter is applied on default cohort.
We calculated the LGD for accounts that cured and for accounts that were written off. The modelling approach can be subdivided into five steps: (1) sample creation; (2) target and weight variables created; (3) input variables; (4) weighted logistic regression; and (5) test for independence.
Note that if any accounts in the considered dataset originated as credit impaired accounts (i.e., accounts starting in default), their loss behaviour will most likely be different from other Stage 3 accounts and should therefore be modelled separately (i.e., segment the portfolio based on this characteristic). In this specific case study here presented, no such accounts existed.
2.2.2. Step 2: Target and Weight Variables Created
Two rows (
and
) are created for each observation (i.e., per account per month). Each row is weighted. Cured and written-off accounts are weighted differently. Mathematically, the weight for observation
is defined as
where the loss given default of observation
(
) is defined as
where
is the number of observations from to ;
is the exposure of observation
; and therefore,
where
and
.
is the proportion of cured observations over the total number of worked-out accounts (over the reference period);
is the proportion of observations that re-default over the reference period;
is the exposure at default (EAD) minus the net present value (NPV) of recoveries from first point of default for all observations in the reference period divided by the EAD—see e.g.,
PWC (
2017) and
Volarević and Varović (
2018);
is the discounted write-off amount for observation ; and
,
and
are therefore empirical calculated values. This should be regularly updated to ensure the final LGD estimate remains a point in time estimate as required by IFRS (
IFRS 2014).
Note that the write-off amount is used in Equation (2) to calculate the actual LGD. An alternative method employs the recovery cash flows over the work out period. A bank is required to use its “best estimate” (a regulatory term, e.g.,
Basel Committe on Banking Supervision (
2019b) and
European Central Bank (
2018)) to determine actual the LGD. In this case, this decision was based on the data available. Only the write-off amount was available for our case study, not the recovered cash flows. In Equation (2), the write-off amount needs to be discounted using the effective interest rate (
PWC 2017), to incorporate time value of money. When recoveries are used, each recovery cash flow needs to be similarly discounted. In the case study, the length of the recovery time period exists in the data and differs for each account. The length of this recovery time period will have an influence on the calculation of LGD: the longer the recovery process, the higher the effective discount rate. In the case study, we used the client interest rate as the effective interest rate when discounting.
Note that, in special circumstances, accounts may be partially written off, leading to an overestimation of provision. This should be taken into account during the modelling process. However, in our case study no such accounts existed.
Illustrative Example
Consider one observation with an exposure of
$50,000. Assume it is a written-off account, for a specific month, with an
= 27% (based on the written-off amount divided by the exposure, i.e.,
). The weight variable for
will be
and
will be
(see
Table 1).
2.2.3. Step 3: Input Variables (i.e., Variable Selection)
All input variables were first screened according to the following three requirements: percentage of missing values, the Gini statistic and business input. If too many values of a specific variable were missing that variable was excluded. Similarly, if a variable had a too low value for the Gini statistic, then that variable was also excluded. Note that business analysts should investigate whether there are any data issues with variables that have low Gini statistics. For example, traditionally strong variables may appear weak if the data has significant sample bias. This forms part of data preparation that is always essential before predictive modelling should take place.
The Gini statistic (
Siddiqi 2006) quantifies a model’s ability to discriminate between two possible values of a binary target variable (
Tevet 2013). Cases are ranked according to the predictions and the Gini then provides a measure of correctness. It is one of the most popular measures used in retail credit scoring (
Baesens et al. 2016;
Siddiqi 2006;
Anderson 2007) and has the added advantage that it is a single value (
Tevet 2013).
Sort the data by descending order of the proportion of events in each attribute. Suppose a characteristic has attributes. Then, the sorted attributes are placed in groups . Each group corresponds to an attribute.
For each of these sorted groups, compute the number of events
and the number of nonevents (#(Y=0)_j)in group
. Then compute the Gini statistic:
where
and
are the total number of events and nonevents in the data, respectively.
Only variables of sufficient Gini and which were considered important from a business perspective were included in the modelling process. All the remaining variables after the initial screening were then binned. The concept of binning is known by different names such as discretisation, classing, categorisation, grouping and quantification (
Verster 2018). For simplicity we use the term binning throughout this paper. Binning is the mapping of continuous or categorical data into discrete bins (
Nguyen et al. 2014). It is a frequently used pre-processing step in predictive modelling and considered a basic data preparation step in building a credit scorecard (
Thomas 2009). Credit scorecards are convenient points-based models that predict binary events and are broadly used due to their simplicity and ease of use; see e.g.,
Thomas (
2009) and
Siddiqi (
2006). Among the practical advantages of binning are the removal of the effects of outliers and a convenient way to handle missing values (
Anderson 2007). The binning was iteratively done by first generating equal-width bins, followed by business input-based adjustments to obtain the final set. Note that if binned variables are used in logistic regression, the final model can easily be transformed into a scorecard.
All bins were quantified by means of the average LGD value per bin. The motivation behind this was to propose an alternative to using dummy variables. Logistic regression cannot use categorical variables coded in its original format (
Neter et al. 1996). As such, some other measure is needed for each bin to make it usable—the default technique of logistic regression is a dummy variable for each class less one. However, expanding categorical inputs into dummy variables can greatly increase the dimension of the input space (
SAS Institute 2010). One alternative to this is to quantify (e.g., using weights of evidence (WOE)—see
Siddiqi (
2006)) each bin using the target value (in our case the LGD value), which will reduce the number of estimates. An example of this is using the natural logarithm (ln) of the good/bad odds (i.e., the WOE)—see for example
Lund and Raimi (
2012). We used the standardised average LGD value in each bin.
Some of the advantages of binning and quantifying the bins are as follows:
The average LGD value can be calculated for missing values, which will allow ”Missing” to be used in model fit (otherwise these rows would not have been used in modelling). Note that not all missing values are equal and there are cases where they need to be treated separately based on reason for missing, e.g., “No hit” at the bureau vs. no trades present. It is therefore essential that business analysts investigate the reason for missing values and treat them appropriately. This again forms part of data preparation that is always a key prerequisite to predictive modelling.
Sparse outliers will not have an effect on the fit of the model. These outliers will become incorporated into the nearest bin and their contributions diminished through the usage of bin WOE or average LGD.
Binning can capture some of the generalisation (required in predictive modelling). Generalisation refers to the ability to predict the target of new cases and binning improves the balance between being too vague or too specific.
The binning can capture possible non-linear trends (as long as they can be assigned logical causality).
Using the standardised average LGD value for each bin ensures that all variables are of the same scale (i.e., average LGD value).
Using the average LGD value ensures that all types of variables (categorical, numerical, nominal, ordinal) will be transformed into the same measurement type.
Quantifying the bins (rather than using dummy variables) results in each variable being seen as one group (and not each level as a different variable). This aids in reducing the number of parameter estimates.
Next, each of these average LGD values was standardised using the weight variable by calculating the average LGD per bin. An alternative approach could have been to calculate the WOE for each bin. The WOE is regularly used in credit scorecard development (
Siddiqi 2006) and is calculated using only the number of 1’s and the number of 0’s for each bin. Note that our underlying variable of interest (LGD) is continuous. However, since our modelled target variable was dichotomous, we wanted the quantification of the bin to reflect our underlying true target, e.g., the LGD value, which ranges from 0 to 1. This average LGD value per bin was then standardised by means of the weight variable. The weighted mean LGD,
is defined as
where
is the LGD value of observation
and
is the weight of observation
. The weighted standard deviation LGD is defined as
where
is the number of observations. The weighted standardised value for LGD,
for observation
will then be
The standardisation of all input variables implies that the estimates from the logistic regression will be standardised estimates. The benefit is that the absolute value of the standardised estimates can serve to provide an approximate ranking of the relative importance of the input variables on the fitted logistic model (
SAS Institute 2010). If this was not done, the scale of each variable could also have had an influence on the estimate. Note that the logistic regression fitted was a weighted logistic regression with the exposure as weight (split for
and
) and therefore to ensure consistency, we also weighted the LGD with the same weight variable as used in the logistic regression.
Furthermore, pertaining to the month since default as input variable: The model that is developed does not require the length of default for incomplete accounts in order to estimate LGD. It assumes that the length of default for these accounts will be comparable to similar accounts that have been resolved. This is an assumption that can be easily monitored after implementation.