Next Article in Journal
Stability of Solutions to Systems of Nonlinear Differential Equations with Discontinuous Right-Hand Sides: Applications to Hopfield Artificial Neural Networks
Next Article in Special Issue
A Fuzzy Random Survival Forest for Predicting Lapses in Insurance Portfolios Containing Imprecise Data
Previous Article in Journal
Influences of Boundary Temperature and Angular Velocity on Thermo-Elastic Characteristics of a Functionally Graded Circular Disk Subjected to Contact Forces
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bootstrap Bandwidth Selection and Confidence Regions for Double Smoothed Default Probability Estimation

Research Group MODES, Department of Mathematics, CITIC, University of A Coruña, 15071 A Coruña, Spain
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(9), 1523; https://doi.org/10.3390/math10091523
Submission received: 1 April 2022 / Revised: 20 April 2022 / Accepted: 26 April 2022 / Published: 2 May 2022
(This article belongs to the Special Issue Application of Survival Analysis in Economics, Finance and Insurance)

Abstract

:
For a fixed time, t, and a horizon time, b, the probability of default (PD) measures the probability that an obligor, that has paid his/her credit until time t, runs into arrears not later that time t + b . This probability is one of the most crucial elements that influences the risk in credits. Previous works have proposed nonparametric estimators for the probability of default derived from Beran’s estimator and a doubly smoothed Beran’s estimator of the conditional survival function for censored data. They have also found asymptotic expressions for the bias and variance of the estimators, but they do not provide any practical way to choose the smoothing parameters involved. In this paper, resampling methods based on bootstrap techniques are proposed to approximate the bandwidths on which Beran and smoothed Beran’s estimators of the PD depend. Bootstrap algorithms for the calculation of confidence regions of the probability of default are also proposed. Extensive simulation studies show the good behavior of the presented algorithms. The bandwidth selector and the confidence region algorithm are applied to a German credit dataset to analyze the probability of default conditional on the credit scoring.
MSC:
62F40; 62N02; 62G05; 91G40

1. Introduction

The debts coming from clients with unpaid credits have an important impact in the solvency of banks and other credit institutions. According to the Basel Committee on Banking Supervision of the Bank for International Settlements, one of the crucial elements for the risk measurement of capital assignments is the probability of default. For a fixed time, t, and a horizon time, b, the probability of default (PD) can be defined as the probability that a credit that has been paid until time t becomes unpaid not later than time t + b . The PD is allowed to depend on the credit scoring x, which is, usually, some linear combination of informative covariates of the credit and the clients. Standard methods to estimate the PD include logistic models and other binary response parametric regression models. See the studies of [1,2,3,4,5,6], among others.
In recent years, survival analysis has started to be considered an interesting tool in credit risk problems. Since the work by [7], some literature has been developed following this line. See the works of [8,9,10,11,12,13,14]. Nonparametric estimators of the probability of default based on conditional survival function estimators are presented in [12,13] since the probability of default can be written in terms of the conditional survival function in a right censored context. All these nonparametric estimators are based on covariate smoothing. In the recent work [14], a general nonparametric estimator of the probability of default with double smoothing, both in the covariate and in the time variable, was proposed and studied.
In particular, Beran’s estimator and a doubly smoothed Beran’s estimator of the PD were presented in [13,14], respectively, and their asymptotic properties were analyzed. Simulation studies carried out in these papers show a good performance of the estimators, especially of the doubly smoothed Beran’s estimator. However, these previous studies were carried out using theoretical smoothing parameters. Since the integrated mean square error expressions are complex and depend on several population parameters, they are not useful in practice to obtain plug-in estimations of these theoretical bandwidths. The goal of this work is to propose resampling techniques to approximate them.
Bootstrap has become a strong tool in many statistical applications since it was first introduced by [15]. Bootstrap for right censored data was first proposed by [16] and the bootstrap method and its applications were studied in [17]. Asymptotic theory of bootstrap for right censored data was stablished by [18,19]. In [20], bootstrap for nonparametric regression with right censored observations at fixed covariate values was studied. A bootstrap approach for the nonparametric censored regression setup was studied in [21]. In [22] a local cross-validation bandwidth selector was proposed.
Our approach follows the ideas of [21], and it is based on the obvious bootstrap. Both Beran and smoothed Beran’s estimators are bootstrapped in order to approximate their corresponding optimal bandwidths. The bootstrap is also useful to compute confidence regions, as the existing theoretical result only allows to compute pointwise and theoretical confidence intervals, which are not computable in practice since the variance of the estimator again depends on unknown population quantities. A bootstrap algorithm to compute confidence regions for the probability of default is also proposed.
The remainder of this paper is organized as follows. In Section 2, bootstrap selectors for the bandwidths of Beran and smoothed Beran’s estimators are proposed. In Section 3, a simulation study shows the behavior of the PD estimators with bootstrap bandwidths. The issue of obtaining confidence regions for the probability of default, P D ( t | x ) , for a fixed value of x I R and t covering the interval I T R + is addressed in Section 4 using Beran and smoothed Beran’s estimators. A simulation study on the proposed methods for computing confidence regions is shown in Section 5. In Section 6, Beran’s estimator and the smoothed Beran’s estimator with bootstrap bandwidths are used to estimate the probability of default function conditional on the credit scoring for a German credit dataset. Section 7 contains some concluding remarks.

2. Bandwidth Selection for Beran and Smoothed Beran’s PD Estimators

Consider the right censored simple random sample ( X i , Z i , δ i ) i = 1 n of ( X , Z , δ ) where X i represents the covariate, Z i = min { T i , C i } the observed lifetime and δ i = I { T i C i } the censoring indicator, where T i 0 and C i 0 are the time to occurrence of the event and the censoring time for the i-th individual of the sample with i = 1 , , n . In credit risk, usually, X is the credit scoring, Z is the observed maturity, T is the time to default, and C is the time until the end of the study or the anticipated cancellation of the credit. The distribution function of T is denoted by F ( t ) and the survival function by S ( t ) = 1 F ( t ) . The functions F ( t | x ) and S ( t | x ) are the conditional distribution and survival functions of T evaluated at t given X = x . The conditional distribution function of Z is denoted by H ( t | x ) , and the conditional distribution function of C is denoted by G ( t | x ) . It is assumed that an unknown relationship between T and X exists. Given X = x , the distributions of T and C conditional to X = x are supposed to be independent.
Let x I be a fixed value of the covariate X and b > 0 a fixed positive time. Then, the probability of default in a time horizon t + b from a maturity time, t, is defined as follows:
P D ( t | x ) = P ( T t + b | T > t , X = x ) = F ( t + b | x ) F ( t | x ) 1 F ( t | x ) = 1 S ( t + b | x ) S ( t | x ) .
In this section, methods for the automatic selection of the bandwidths for Beran’s estimator and the smoothed Beran’s estimator of the probability of default are proposed.

2.1. Beran’s Estimator

Beran’s estimator of the conditional survival function is given by
S ^ h ( t | x ) = i = 1 n 1 I { Z i t , δ i = 1 } w h , i ( x ) 1 j = 1 n I { Z j < Z i } w h , j ( x )
where w h , i ( x ) = K ( x X i ) / h j = 1 n K ( x X j ) / h with i = 1 , . . . , n , K ( u ) is a kernel function and h = h n is the bandwidth that determines the smoothness introduced in the estimator through the covariate X.
Plugging (2) into (1), the probability of default estimator based on Beran’s PD estimator is obtained:
P D ^ h ( t | x ) = 1 S ^ h ( t + b | x ) S ^ h ( t | x ) .
In this section, bootstrap methods are proposed for the automatic selection of h. There are two classic methods for bootstrap resampling in a censoring context: the obvious bootstrap and the simple bootstrap. The equivalence between both methods in an unconditional setup is proved in [16]. In [21], this result is extended to the case where a covariate is involved, assuming there is no ties in the sample values of the covariate. This was done by proving the equivalence of the two resampling methods, the obvious bootstrap and the simple weighted bootstrap. In this paper, the following obvious bootstrap method combined with a smoothed bootstrap for the covariate is proposed.

2.1.1. Algorithm for Bootstrap Resampling

Let I 1 R be an interval containing appropriate bandwidth values and let r I 1 be the pilot bandwidth for the bootstrap resampling:
  • Obtain U 1 , , U n iid with U i U ( 0 , 1 ) and V 1 , , V n iid with common density K for all i = 1 , , n .
  • For each i = 1 , , n , define
    X i * = X [ n U i ] + 1 + r V i ,
    where [ u ] is the integer part of u. Generate T i * from Beran’s estimator of the conditional distribution of T using the sample { ( X i , Z i , δ i ) } i = 1 n and bandwidth r, denoted by F ^ r ( t | X i * ) , and C i * from the Beran’s estimator of the conditional distribution of C using the sample { ( X i , Z i , 1 δ i ) } i = 1 n and bandwidth r, denoted by G ^ r ( t | X i * ) .
    The estimators F ^ r ( t | X i * ) and G ^ r ( t | X i * ) are forced to be equal to one from the last observed lifetime ( max { Z i : i = 1 , , n } ) onwards.
  • For each i = 1 , , n , obtain
    Z i * = min { T i * , C i * } ,
    δ i * = I T i * C i * .
  • Consider the bootstrap resample ( X i * , Z i * , δ i * ) i = 1 n .
In this paper, we want to estimate the probability of default function, P D ( t | x ) , for a fixed x I and t covering the interval I T R . Therefore, our goal is to get the bandwidth h M I S E I 1 that minimizes the mean integrated squared error given by
M I S E x ( h ) = E I T P D ^ h ( t | x ) P D ( t | x ) 2 d t
whose bootstrap approximation is
M I S E x * ( h ) = E I T P D ^ h * ( t | x ) P D ^ r ( t | x ) 2 d t
where P D ^ r ( t | x ) is the estimation of the theoretical PD with pilot bandwidth r, using the sample ( X i , Z i , δ i ) i = 1 n and P D ^ h * ( t | x ) is the bootstrap estimation of P D with bandwidth h, using the bootstrap resample ( X i * , Z i * , δ i * ) i = 1 n .
The resampling distribution of P D ^ h * ( t | x ) cannot be computed in a close form, so the Monte Carlo method is used. It is based on obtaining B bootstrap resamples and estimating P D ^ h * ( t | x ) for each of them. Thus, the distribution of P D ^ h * ( t | x ) is approximated by the empirical one of P D ^ h * , 1 ( t | x ) , , P D ^ h * , B ( t | x ) , obtained from B bootstrap resamples and the bootstrap version of the estimation error committed by Beran’s estimator for any smoothing parameter h is given by
M I S E x * ( h ) 1 B k = 1 B I T P D ^ h * , k ( t | x ) P D ^ r ( t | x ) 2 d t .
Likewise, the integral is approximated by a Riemann sum.

2.1.2. Algorithm for Bootstrap Bandwidth Selector Based on Beran’s Estimator

Let x I be a fixed value of the covariate, t I T and r I 1 :
  • Compute P D ^ r ( t | x ) from the original sample { ( X i , Z i , δ i ) } i = 1 n .
  • Obtain B bootstrap resamples of the form { ( X i * , k , Z i * , k , δ i * , k ) } i = 1 n with k = 1 , . . . , B using the smoothed bootstrap with pilot bandwidth r I 1 and calculate P D ^ h * , k ( t | x ) for each of them.
  • Approximate M I S E x * ( h ) according to (5).
  • Repeat Steps 1–3 for values of h in a grid of I 1 .
  • Select the value of h that provides the smallest M I S E x * ( h ) as the bootstrap bandwidth h * .
Concerning the auxiliary bandwidth r I 1 , a preliminary analysis not shown here suggests
r = 3 4 Q X ( 0.975 ) Q X ( 0.025 ) i = 1 n δ i 1 / 3
where Q X ( u ) is the u quantile of the sample X i i = 1 n , as a suitable pilot bandwidth in this context.
Note that the proposed algorithm is also valid to obtain a bootstrap approximation of the optimal bandwidth for the estimation of P D ( t | x ) for fixed values of t I T and x I by replacing M I S E x * ( h ) by M S E t , x * ( h ) , which is the bootstrap analogue of
M S E t , x ( h ) = E P D ^ h ( t | x ) P D ( t | x ) 2 .

2.2. Smoothed Beran’s Estimator

Nonparametric estimators for the probability of default such as Beran’s estimator, P D ^ h ( t | x ) , are smoothed in the covariate X. It is interesting to consider estimators with double smoothing both in the covariate, X, and in the time variable, T. This idea was previously used in [23,24,25] to obtain doubly smoothed estimators of the conditional survival function. In [14], a doubly smoothed version of PD Beran’s estimator was proposed based on a further smoothing, in time, of Beran’s estimator of the conditional survival function ([26]). Asymptotic properties and simulation studies carried out in [14] show that the doubly smoothed Beran’s estimator performs better than the classical Beran’s estimator when estimating the probability of the default curve. This section presents a method for automatic selection of the bandwidths on which it depends.
Let S ^ h ( t | x ) be Beran’s estimator of the conditional survival function given in (2) with h = h n being the smoothing parameter for the covariate. Then, the expression for the doubly smoothed Beran’s estimator of the conditional survival function defined in [25] is as follows:
S ˜ h , g ( t | x ) = 1 i = 1 n s ( i ) K t Z ( i ) g
with s ( i ) = S ^ h ( Z ( i 1 ) | x ) S ^ h ( Z ( i ) | x ) where Z ( i ) is the i-th element of the sorted sample of Z, K ( t ) is the distribution function of a kernel K, K ( t ) = t K ( u ) d u , and g = g n is the smoothing parameter for the time variable. Then, plugging (7) into (1), the probability of the default estimator based on the smoothed Beran’s survival estimator is obtained:
P D ˜ h , g ( t | x ) = 1 S ˜ h , g ( t + b | x ) S ˜ h , g ( t | x ) .
A bootstrap method is proposed for the automatic selection of the bivariate bandwidth ( h , g ) .

2.2.1. Algorithm for Bootstrap Resampling

Let I 1 R and I 2 R be intervals containing appropriate bandwidth values and let r I 1 and s I 2 be pilot bandwidths for the smoothed resample of X, T and C:
  • Obtain U 1 , , U n iid with U i U ( 0 , 1 ) and V 1 , , V n iid with common density K, W 1 1 , , W n 1 iid with common density K and W 1 2 , , W n 2 iid with common density K for all i = 1 , , n .
  • For each i = 1 , , n , obtain
    X i * = X [ n U i ] + 1 + r V i ,
    T i * = T 0 , i * + s W i 1
    C i * = C 0 , i * + s W i 2
    where T 0 , i * is resampled from F ^ r ( t | X i * ) constructed using Beran’s estimator with the sample { ( X i , Z i , δ i ) } i = 1 n and C 0 , i * is resampled from G ^ r ( t | X i * ) constructed using Beran’s estimator with the sample { ( X i , Z i , 1 δ i ) } i = 1 n .
  • For each i = 1 , , n , obtain
    Z i * = min { T i * , C i * } ,
    δ i * = I T i * C i * .
  • Consider the bootstrap resample ( X i * , Z i * , δ i * ) i = 1 n .
The conditional distribution functions of T * | X * and C * | X * are, respectively, the smoothed Beran’s estimators F ˜ r , s ( t | X i * ) and G ˜ r , s ( t | X i * ) .
The optimal bivariate bandwidth, ( h M I S E , g M I S E ) I 1 × I 2 is defined as the pair of bandwidths that minimizes the mean integrated squared error given by
M I S E x ( h , g ) = E I T P D ˜ h , g ( t | x ) P D ( t | x ) 2 d t .
The bootstrap version of M I S E x ( h , g ) is given by
M I S E x * ( h , g ) = E I T P D ˜ h , g * ( t | x ) P D ˜ r , s ( t | x ) 2 d t ,
where P D ˜ r , s ( t | x ) is the smoothed Beran’s PD estimation with pilot bandwidths ( r , s ) I 1 × I 2 using the sample ( X i , Z i , δ i ) i = 1 n and P D ˜ h , g * ( t | x ) is the bootstrap estimation of P D with bandwidths ( h , g ) , using the bootstrap resample ( X i * , Z i * , δ i * ) i = 1 n . Since the sampling distribution of P D ˜ h , g * ( t | x ) is unknown, the Monte Carlo method gives the following approximation
M I S E x * ( h , g ) 1 B k = 1 B I T P D ˜ h , g * , k ( t | x ) P D ˜ r , s ( t | x ) 2 d t ,
based on the empirical distribution of P D ˜ h , g * ( t | x ) obtained from B bootstrap resamples. The integral is approximated by a Riemann sum.

2.2.2. Algorithm for Bootstrap Bandwidth Selector Based on the Smoothed Beran’s Estimator

Let x be a fixed value of the covariate, t I T and ( r , s ) I 1 × I 2 :
  • Compute P D ˜ r , s ( t | x ) from the original sample { ( X i , Z i , δ i ) } i = 1 n .
  • Obtain B bootstrap resamples of the form { ( X i * , k , Z i * , k , δ i * , k ) } i = 1 n with k = 1 , . . . , B using the doubly smoothed bootstrap and calculate P D ˜ h , g * , k ( t | x ) for each of them.
  • Approximate M I S E x * ( h ) according to (10).
  • Repeat Steps 1–3 for pairs of values ( h , g ) in a grid of I 1 × I 2 .
  • Obtain the pair ( h , g ) that provides the smallest M I S E x * ( h , g ) as the bootstrap bandwidth ( h * , g * ) .
The auxiliary bandwidth r I 1 was defined in (6). The pilot bandwidth s I 2 for the time variable smoothing is
s = 3 4 Q Z ( 0.975 ) Q Z ( 0.025 ) i = 1 n δ i 1 / 7
where Q Z ( u ) is the u quantile of the sample Z i i = 1 n .

3. Simulation Study for Bandwidth Selection

A simulation study was conducted in order to show the behavior of bootstrap bandwidth selectors for Beran’s and smoothed Beran’s estimators proposed in Section 2. Two models are considered, one with Weibull lifetime and censoring time distributions and another one with exponential distributions.
Model 1 considers a U ( 0 , 1 ) distribution for X. The time to occurrence of the event conditional to the covariate, T | X = x , follows a Weibull distribution with parameters d = 2 and Γ ( x ) 1 / d where Γ ( x ) = 1 + 5 x , and the censoring time conditional to the covariate, C | X = x , follows a Weibull distribution with parameters d = 2 and Δ ( x ) 1 / d where Δ ( x ) = 10 + d 1 x + 20 x 2 . In this case, the conditional survival function and the censoring conditional probability are given by
S ( t | x ) = e Γ ( x ) t d ,
P ( δ = 0 | X = x ) = Δ ( x ) Γ ( x ) + Δ ( x ) .
Having set the value of the covariate x = 0.6 , the value of d 1 is chosen so that the censoring conditional probability is 0.2 and 0.5 . These values are d 1 = 27 and d 1 = 22 , respectively. The conditional survival function for this model is estimated in a time grid of size n T , 0 < t 1 < < t n T , where t n T + b = F 1 ( 0.95 | x ) = 0.8654 and b = 0.15 , i.e., about 20 % of the time grid range for the value of the covariate x = 0.6 . Therefore, in this case, I T = ( 0 , 0.8654 ) .
Model 2 also considers a U ( 0 , 1 ) distribution for X. The time to occurrence of the event conditional to the covariate, T | X = x , follows an exponential distribution with parameter E ( x ) = 2 + 58 x 160 x 2 + 107 x 3 , and the censoring time conditional to the covariate, C | X = x , follows an exponential distribution with parameter Θ ( x ) = 10 + c 1 x + 20 x 2 . In this scenario, the conditional survival function and the censoring conditional probability are the following:
S ( t | x ) = e E ( x ) t ,
P ( δ = 0 | X = x ) = Θ ( x ) E ( x ) + Θ ( x ) .
Having set the value of the covariate, x = 0.8 , the value of c 1 is chosen so that the censoring conditional probability is 0.2 and 0.5 . These values are c 1 = 113 / 4 and c 1 = 55 / 2 , respectively. The conditional survival function is estimated in a time grid of size n T , 0 < t 1 < < t n T , where t n T + b = F 1 ( 0.95 | x ) = 3.8211 and b = 0.7 , i.e., about 20 % of the time grid range for the value of the covariate x = 0.8 . Therefore, in this case, I T = ( 0 , 3.8211 ) .
It can be proved that Model 1 is close to a proportional hazards model, while Model 2 moves away from this parametric model. These two models were used in the simulation study carried out by [14].
The boundary effect is corrected using the reflexion principle, and the truncated Gaussian kernel with a truncation range ( 50 , 50 ) is considered. The size of the lifetime grid is n T = 100 . The sample size is n = 400 . The simulation study is carried out with software developed in R by the authors themselves. In order to minimize the MISE error function without increasing CPU time more than necessary, a limited-memory algorithm for solving large nonlinear optimization problems is used, L-BFGS-B. It was proposed by [27] for solving optimization problems subject to simple bounds on the variables in which information on the Hessian matrix is difficult to obtain. Results of numerical studies about this method are shown in [27]. It is available at the stats package from the Comprehensive R Archive Network (CRAN) using Fortran 77 subroutines (see [28]).

3.1. Simulation Study for Beran’s Estimator

In this subsection, the behavior of the bootstrap bandwidth selector for Beran’s estimator is shown. For each model, the estimation error function M I S E x ( h ) is approximated via Monte Carlo using 300 simulated samples. The bandwidth that minimises M I S E x ( h ) is obtained and denoted by h M I S E . The values of h M I S E and M I S E x ( h M I S E ) are used as a benchmark.
In the simulation study, N = 300 simulated samples are used. For each sample, B = 500 bootstrap resamples are obtained to approximate the bootstrap MISE function, M I S E x * ( h ) , and obtain the bootstrap bandwidth associated to each simulated sample h j * , j = 1 , 2 , , N . The mean value of the N bootstrap bandwidths and the standard deviation are defined as follows
h * ¯ = 1 N j = 1 N h j * , s d h * = 1 N j = 1 N h j * h * ¯ 2
As a relative measure of the difference between the bootstrap bandwidth and the optimal one, we compute
H j * = h j * h M I S E h M I S E
with j = 1 , , N . The mean of the absolute value of these relative deviations, H * ¯ = 1 N j = 1 N | H j * | , is a good measure of how close the bootstrap bandwidth is to the optimal one.
For each sample, the estimation error committed by Beran’s estimator with the corresponding bootstrap bandwidth,
M I S E x ( h j * ) = E I T P D ^ h j * ( t | x ) P D ( t | x ) 2 d t ,
and its squared root, R M I S E x ( h j * ) , are approximated via Monte Carlo using 300 simulated samples. The mean of these estimation errors given by
R M I S E x ( h * ) ¯ = 1 N j = 1 N R M I S E x ( h j * )
is used as a measure of the estimation error made by the bootstrap bandwidth, when compared with the estimation error made by the MISE bandwidth.
As a relative measure of the difference between the estimation errors using the bootstrap and the MISE bandwidths, the following ratios are defined:
R j * = R M I S E x ( h j * ) R M I S E ( h M I S E ) R M I S E ( h M I S E )
satisfying R j * 0 for all j = 1 , , N . The mean of the R j * values with j = 1 , , N is denoted by R * ¯ = 1 N j = 1 N R j * . Small values (close to zero) of H * ¯ and R * ¯ indicate good behavior of the bootstrap bandwidth. Values of the bootstrap bandwidths, estimation errors and relative measures for Models 1 and 2 are included in Table 1. The results show a good performance of the proposed bootstrap selector.
Figure 1 and Figure 2 show the function M I S E x ( h ) along with the Monte Carlo approximations of M I S E x * ( h ) for some simulated samples and the boxplots of H j * and R j * with j = 1 , , N for Models 1 and 2. The method tends to slightly underestimate the value of h * with respect to h M I S E in Model 1 and overestimate its value in Model 2, which is reflected in the boxplots of H j * . Nevertheless, these figures show that the M I S E x ( h ) curve is fairly flat and variations in the selection of h do not imply an important increase in the estimation error.
In order to illustrate the results, Figure 3 shows the theoretical probability of default function and Beran’s estimation with the MISE and bootstrap bandwidths drawn for one sample from Model 1 and 2 when the conditional probability of censoring is 0.5 . For large values of time, the performance of the estimator becomes worse, due to the fact that in that region there are few data, most of them censored, and therefore offering poor information.

3.2. Simulation Study for the Smoothed Beran’s Estimator

In this section, a simulation study on the bootstrap bandwidth selector of the smoothed Beran’s estimator in (8) is carried out. The resampling technique and Monte Carlo approximation of the MISE presented in Section 2.2 are used.
For each model, the error function M I S E x ( h , g ) is approximated via Monte Carlo from 300 simulated samples, and the bivariate bandwidth that minimizes M I S E x ( h , g ) is obtained and denoted by ( h M I S E , g M I S E ) . The values of ( h M I S E , g M I S E ) and M I S E x ( h M I S E , g M I S E ) are used as a benchmark.
In the study, N = 300 samples are simulated. For each simulated sample, the corresponding bootstrap bandwidths are approximated from B = 500 resamples, obtaining ( h j * , g j * ) with j = 1 , , N . The mean value of the N bootstrap bandwidths and the standard deviation are the following:
( h * ¯ , g * ¯ ) = 1 N j = 1 N h j * , 1 N j = 1 N g j * ,
s d h * = 1 N j = 1 N h j * h * ¯ 2 , s d g * = 1 N j = 1 N g j * g * ¯ 2 .
In order to measure the distance of the bootstrap bidimensional bandwidth of the j-th sample, ( h j * , g j * ) , from the corresponding MISE bandwidth, ( h M I S E , g M I S E ) , consider the vector
D j * = h j * h M I S E h M I S E , g j * g M I S E g M I S E R 2 .
and its Euclidean norm denoted by H j * = D j * 2 with j = 1 , , N . The mean value, H * ¯ = 1 N j = 1 N H j * is a measure of how close the bootstrap bandwidths are to the MISE one.
For each sample, the estimation error committed by the smoothed Beran’s estimator with the corresponding bootstrap bandwidth,
M I S E x ( h j * , g j * ) = E I T P D ˜ h j * , g j * ( t | x ) P D ( t | x ) 2 d t ,
and its squared root, R M I S E x ( h j * , g j * ) , are approximated via Monte Carlo using 300 simulated samples. The mean of these estimation errors given by
R M I S E x ( h * , g * ) ¯ = 1 N j = 1 N R M I S E x h j * , g j *
is used as a measure of the estimation error committed by the bootstrap bidimensional bandwidth in the model.
The ratio
R j * = R M I S E x ( h j * , g j * ) R M I S E x ( h M I S E , g M I S E ) R M I S E x ( h M I S E , g M I S E )
is defined as a relative measure of the difference between the error committed by the estimator with bootstrap bandwidth and MISE bandwidth. The mean of the positive values R j * with j = 1 , , N is denoted by R * ¯ = 1 N j = 1 N R j * . Values of the bootstrap bivariate bandwidths, estimation errors and relative measures for Models 1 and 2 are included in Table 2.
Figure 4 shows the M I S E x ( h , g ) function of the smoothed Beran’s estimator and its bootstrap approximation for one sample of both Models 1 and 2 when the conditional probability of censoring is 0.5 . It is approximated on a grid of 50 values of h and 50 values of g. Note that both M I S E x ( h , g ) and M I S E x * ( h , g ) curves for each fixed h value are quite similar in the region close to the minimum value of M I S E x * ( h , g ) . Thus, the influence of covariate smoothing parameter h is weak when estimating the PD using values of bandwidth g close to the optimal one.
Figure 5 and Figure 6 show the boxplots of H j * and R j * with j = 1 , , N . In general, the selector tends to underestimate the value of the bandwidths. Due to the behavior of the M I S E x ( h , g ) curves mentioned above, this does not lead to a significant increase in the estimation error.
Figure 7 shows the theoretical probability of the default function and Beran’s estimation with MISE and bootstrap bandwidths for one sample from Models 1 and 2 when the conditional probability of censoring is 0.5 . Comparing this figure with the equivalent one for Beran’s estimator shown in Figure 3, the improvement in estimation due to the double smoothing is remarkable.
The results showed in Table 1 and Table 2 are summarized in Table 3 to compare the behavior of Beran (BERAN) and the smoothed Beran’s (SBERAN) estimators for the PD and to evaluate whether the improvement that smoothing in the time variable provides for PD estimation is preserved when approximating the smoothing parameters by resampling techniques. Table 3 shows the estimation errors committed by Beran and the smoothed Beran’s estimators of the probability of default using bootstrap bandwidths. In order to measure the increase in estimation error resulting from using Beran’s estimator, the following ratio is defined:
R S = R M I S E x ( h * ) ¯ R M I S E ( h * , g * ) ¯ R M I S E ( h * , g * ) ¯
and included in Table 3.
In Model 1, the estimation error committed by Beran’s estimator is 20 % larger than the error committed by the smoothed Beran’s estimator when the conditional probability of censoring is 0.2 and by 50 % when the conditional probability of censoring is 0.5. In Model 2, these differences are even more significant: the estimation error increases up to 80 % when using Beran’s estimator with bootstrap bandwidth instead of the doubly smoothed Beran’s estimator.

4. Confidence Regions Using Beran and Smoothed Beran’s Estimators

Let x I be a fixed value of the covariate and consider P D ( t | x ) the probability of default curve with t I T . The curve P D ( t | x ) belongs to the function space F ( I T ) whose elements are real-valued functions with domain I T . From the sample { ( X i , Z i , δ i ) , i = 1 , . . . , n } , Beran’s estimation of P D ( t | x ) , P D ^ h ( t | x ) , is obtained and a confidence region of P D ( t | x ) at 1 α confidence level associated to Beran’s estimator can be constructed. A similar construction is done for the smoothed Beran’s estimator. This confidence region of P D ( t | x ) is a random subset of F ( I T ) denoted by R α that satisfies
P ( P D ( t | x ) R α , t I T ) = 1 α .
In this section, a method for constructing confidence regions, R α , based on Beran and the smoothed Beran’s estimator is developed.
First, Beran’s estimator of the probability of default, P D ^ h ( t | x ) , given in (3) is used. This method follows the ideas of [29] to obtain prediction regions. It is based on finding the value of λ α R + such that
P | P D ^ h ( t | x ) P D ( t | x ) | < λ α σ ( t ) , t I T = 1 α
with σ 2 ( t ) = V a r P D ^ h ( t | x ) . Thus, the theoretical confidence region is defined by
R α = P D ^ h ( t | x ) λ α σ ( t ) , P D ^ h ( t | x ) + λ α σ ( t ) : t I T .
Since λ α and σ ( t ) are unknown, they are approximated by means of a bootstrap technique. The bootstrap confidence region is defined as follows:
R α * = P D ^ h * ( t | x ) λ α * σ * ( t ) , P D ^ h * ( t | x ) + λ α * σ * ( t ) : t I T .
where P D ^ h * ( t | x ) is the bootstrap estimation of P D with bandwidth h and λ α * and σ * ( t ) are the bootstrap analogue of λ α and σ ( t ) . The confidence region R α * satisfies
p ( λ α * ) = P P D ^ r ( t | x ) R α * , t I T = 1 α .
From the original sample ( X i , Z i , δ i ) i = 1 n , Beran’s estimator of P D ( t | x ) is obtained with appropriate bandwidth h, P D ^ h ( t | x ) . The algorithm to obtain the bootstrap confidence region for P D ( t | x ) at confidence level 1 α associated to P D ^ h ( t | x ) is explained below. The Monte Carlo method is used to approximate σ * ( t ) , and an iterative method is used to approximate the value of λ α * so that the confidence region has a confidence level approximately equal to 1 α .

4.1. Confidence Region Based on Beran’s Estimator

  • Compute Beran’s estimator P D ^ r ( t | x ) from the original sample ( X i , Z i , δ i ) i = 1 n and pilot bandwidth r I 1 .
  • Generate B bootstrap resamples of the form ( X i * , k , Z i * , k , δ i * , k ) i = 1 n by means of the resampling algorithm presented in SubSection 2.1 and pilot bandwidth r.
  • For k = 1 , , B , compute P D ^ h * , k ( t | x ) with the k-th bootstrap resample and bandwidth h, obtaining P D ^ h * , k ( t | x ) k = 1 B .
  • Approximate the standard deviation of P D ^ h * ( t | x ) by
    σ * ( t ) 1 B k = 1 B P D ^ h * , k ( t | x ) 1 B l = 1 B P D ^ h * , l ( t | x ) 2 1 / 2 , t I T .
  • Use an iterative method to obtain an approximation of the value λ α * defined in (12).
  • The confidence region is given by
    R α = P D ^ h ( t | x ) λ α * σ * ( t ) , P D ^ h ( t | x ) + λ α * σ * ( t ) : t I T .

4.2. Iterative Method to Approximate λ α *

The iterative method to approximate the value of λ α * R + so that the confidence region R α * has a confidence level approximately equal to 1 α is explained below. This algorithm allows the parameter λ α * to be approximated quickly and efficiently.
Let P D ^ h * , k ( t | x ) k = 1 B be the Beran’s estimations of the PD with bandwidth h over a set of B bootstrap resamples of ( X i , Z i , δ i ) i = 1 n . Define the Monte Carlo approximation of p ( λ ) in (12), for any λ R + , as follows:
p ( λ ) 1 B k = 1 B I P D ^ r ( t | x ) P D ^ h * , k ( t | x ) λ σ * ( t ) , P D ^ h * , k ( t | x ) + λ σ * ( t ) , t I T .
Let λ L , λ H R + be such that p ( λ L ) 1 α p ( λ H ) and let ζ > 0 be a tolerance, for example, ζ = 10 4 .
  • Obtain λ M = λ L + λ H 2 and compute Monte Carlo approximations of p ( λ L ) , p ( λ M ) and p ( λ H ) according to (13).
  • If p ( λ M ) = 1 α or p ( λ H ) p ( λ L ) < ζ , then λ α * = λ M . Otherwise,
    2.1
    If 1 α < p ( λ M ) , then λ H = λ M and return to Step 1.
    2.2
    If p ( λ M ) < 1 α , then λ L = λ M and return to Step 1.
A preliminary analysis not shown here suggests the following pilot bandwidth:
r = 3 4 Q X ( 0.975 ) Q X ( 0.025 ) i = 1 n δ i 1 / 3 .
This method to obtain confidence regions for the curve P D ( t | x ) for fixed x I and t covering I T based on Beran’s estimator can be adapted to obtain confidence regions using the doubly smoothed Beran’s estimator. Simply replace Beran’s estimator P D ^ h ( t | x ) by the smoothed Beran’s estimator P D ˜ h , g ( t | x ) given in (8) where necessary, and obtain the analogous bootstrap approximations of λ α and σ ( t ) . The confidence region is given by
R α = P D ˜ h , g ( t | x ) λ α * σ * ( t ) , P D ˜ h , g ( t | x ) + λ α * σ * ( t ) : t I T .
Denote the lower and upper bounds of the confidence region by l ( t , x ) and u ( t , x ) , respectively. It may happen that the lower bound of the confidence region is less than 0 or the upper bound is greater than 1 for some points ( t 0 , x 0 ) . When this happens, we set l ( t 0 , x 0 ) = 0 or u ( t 0 , x 0 ) = 1 , as appropriate.
The pilot bandwidths defined in (6) and (11) are used for the confidence region algorithm based on both Beran and smoothed Beran’s estimators.

5. Simulation Study for Confidence Regions

A simulation study is carried out to test the performance of bootstrap confidence regions proposed. Models 1 and 2 described in Section 3 are considered in this study, with identical features. The methods shown in Section 4 are used for this purpose with both Beran and smoothed Beran’s estimators. When Beran’s estimator is used, the bandwidth that minimizes the mean integrated squared error, h = h M I S E , is used. Similarly, if the smoothed Beran’s estimator is used, the two-dimensional bandwidth that minimizes the mean integrated squared error, ( h , g ) = ( h M I S E , g M I S E ) , is used. These bandwidths are unknown in practice, but they allow a fair comparison of methods in the simulation study.
The simulation set-up is the one explained in Section 3. Two conditional probabilities of censoring are considered for each model: P ( δ = 0 | x ) = 0.2 and P ( δ = 0 | x ) = 0.5 . The number of bootstrap resamples of each samples is B = 500 , and N = 300 simulated samples of each model are obtained. The sample size is n = 400 . The confidence level is 1 α with α = 0.05 .
Figure 8 shows Beran’s estimations of the PD for B = 500 resamples from one sample of Model 2 when the conditional probability of censoring is 0.5. The theoretical probability of default is also plotted in the figure. The PD is estimated on a time grid t 1 = 0 < t 2 < < t n T such that t n T + b = F 1 ( 0.95 | x ) . The information provided by the data in the right tail of such a time distribution is sparse due to high censoring. The method results in extremely wide confidence regions or degeneration to zero as in the case of Model 2 (see Figure 8). Therefore, the time grid in this section is restricted to the interval where sufficient information is available.
For this section, we consider the problem of obtaining the bootstrap confidence region for the probability of default in a time grid t 1 = 0 < t 2 < < t n T such that with t k I T R + for all k = 1 , , n T and t n T + b = F 1 ( 0.70 | x ) , with b being approximately equal to 20 % of the grid length. For Model 1, having set the value of the covariate, x = 0.6 , the time horizon is b = 0.1 ( 20 % of the time range) and t n T + b = F 1 ( 0.70 | x = 0.6 ) = 0.55 . For Model 2, having set the value of the covariate, x = 0.8 , the time horizon is b = 0.3 ( 20 % of the time range) and t n T + b = F 1 ( 0.70 | x = 0.8 ) = 1.55 . Table 4 contains the bandwidths that minimize the MISE function for Beran’s estimator and the smoothed Beran’s estimator along this new time grid.
For each model, the confidence region is obtained according to the method explained in Section 4 using both Beran’s estimator and the smoothed Beran’s estimator. The criteria for comparing the methods are set out below.
A confidence region performs well if its coverage is close to the nominal one, in this case 1 α = 0.95 , and has a small area or average width. Denoting l ( t , x ) = P D ^ h ( t | x ) λ α * σ * ( t ) and u ( t , x ) = P D ^ h ( t | x ) + λ α * σ * ( t ) when using Beran’s estimator or l ( t , x ) = P D ˜ h , g ( t | x ) λ α * σ * ( t ) and u ( t , x ) = P D ˜ h , g ( t | x ) + λ α * σ * ( t ) when using the smoothed Beran’s estimator, the following values measure the performance of the confidence region and allow for the comparison of results.
Coverage is the percentage of bootstrap regions that contain the whole theoretical probability of default curve and it is defined as follows
1 N j = 1 N I P D ( t k | x ) l ( t k , x ) , u ( t k , x ) , k = 1 , . . . , n T .
The mean pointwise coverage is the mean of the proportion of time grid values for which the confidence region contains the theoretical probability of default curve. It is given by
1 N j = 1 N 1 n T k = 1 n T I P D ( t k | x ) l ( t k , x ) , u ( t k , x ) .
Average width of the bootstrap confidence region is defined by
1 N j = 1 N 1 n T k = 1 n T u ( t k , x ) l ( t k , x ) .
Winkler score (see [30]) is also used to compare the behavior of the methods. For classical confidence or prediction intervals, it is defined as the length of the interval plus a penalty if the theoretical value is outside the interval. Thus, it combines width and coverage. For values that fall within the interval, the Winkler score is simply the length of the interval. So low scores are associated with narrow intervals. When the theoretical value falls outside the interval, the penalty is proportional to how far the observation is from the interval. The formula of the Winkler score (WS) as a function of the time and covariate variables is as follows:
WS ( t , x ) = u ( t , x ) l ( t , x ) + 2 α ( l ( t , x ) S ( t | x ) ) I S ( t | x ) < l ( t , x ) + 2 α ( S ( t | x ) u ( t , x ) ) I S ( t | x ) > u ( t , x ) .
Since we are working with confidence regions for fixed x I and t varying over the interval I T , the integrated Winkle score is proposed as a criteria for the comparison of the confidence regions. It is defined by
I W S ( x ) = I T W S ( t , x ) d t .
and the lower the value of IWS, the better the performance of the confidence region.
The results obtained are shown in Table 5. The high values of pointwise coverage in all scenarios are remarkable. Furthermore, these coverage percentages are preserved when using double smoothing, while the average width of the confidence regions is halved. This is reflected in the IWS, which presents much larger values in the Beran’s estimator-based confidence regions.
This analysis is also illustrated in Figure 9 and Figure 10, where the confidence region for the probability of default of one sample from Models 1 and 2 is shown. These graphs show the higher variability of the Beran’s estimations in the resamples with respect to the smoothed Beran’s estimations. This leads to much wider confidence regions, especially at the right tail of the time distribution.

6. Application to Real Data

In this section, bandwidth selectors for Beran’s and the smoothed Beran’s estimators are applied to the German Credit dataset, and the confidence region of the probability of default is obtained. This dataset is publicly available on the webpage http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data) (accessed on 15 September 2021) and was previously analyzed in [31]. It includes information about 1000 credits, from which 293 were classified as bad credits and 707 as good credits. Then, the censoring ratio of this dataset is 70.7 % . The duration of the credits in months is available along with the credit amount, checking account, savings amount and time of employment, among others.
The duration of the credits is set as the time to default, Z, the bad/defaulted credits are denoted by δ = 1 and the good credits by δ = 0 . Let us denote the credit scoring with X = ( 1 , θ 2 , θ 3 , θ 4 ) ( X 1 , X 2 , X 3 , X 4 ) .
Since some of the original covariates are ordinal (interval) variables, we change them into numerical variables by following the criteria explained in [31]: X 1 is already a continuous variable denoting amount of credit in DM, X 2 { 0.05 , 0.01 , 0.25 , 0 } denotes the amount of money in the checking account in thousands of DM, X 3 { 0 , 0.05 , 0.25 , 0.75 , 1.25 } denotes the savings amount in thousands of DM, and X 4 { 0 , 0.5 , 2.5 , 5.5 , 8.5 } denotes the years of employment. The single-index method proposed by [31] is used to estimate ( 1 , θ 2 , θ 3 , θ 4 ) . The credit scoring is obtained as follows:
X = X 1 + 3.2091 X 2 + 0.2312 X 3 + 2.1891 X 4
Figure 11 shows the scatter plot between credit scoring and follow-up time variable by distinguishing between the censored and uncensored (and therefore, defaulted) credits. A dependency relationship between the two variables can be identified in the plot.
The probability of default, P D ( t | x ) , is estimated when x = 0.85 , which is a close value to the sample mean of the credit scoring, and t [ 0 , 60 ] . The bandwidth selector presented in Section 2.1 is used to approximate the optimal bandwidth for Beran’s estimator, obtaining h * = 0.500 . The bandwidth selector presented in Section 2.2 gives the bootstrap approximation of the optimal bivariate bandwidth for the smoothed Beran’s estimator, ( h * , g * ) = ( 0.102 , 13.614 ) . The estimations of the conditional survival function and the probability of default by means of Beran’s and the smoothed Beran’s estimator with the corresponding bootstrap bandwidths are shown in Figure 12. The poor behavior of Beran’s PD estimator for large values of time is evident. The results obtained by the doubly smoothed Beran’s estimator seem to be more appropriate, since the roughness of the Beran’s estimaton is not expected in this type of curve. Supporting the conclusion of our real data analysis, the estimation of the probability of default over time for the assessment of risk in portfolios and bond rating obtained in [32,33] have shapes similar to those obtained here.
Finally, the confidence region methods proposed in Section 4 are applied. Since the MISE bandwidths are unknown in this context, bootstrap bandwidths are used. The bootstrap resamples and the resulting confidence regions at confidence level 95 % using each estimator are shown in Figure 13. The average width of the confidence region based on Beran’s estimator is 0.5581 , and the average width of the one based on the smoothed Beran’s estimator is 0.1438 . Note that the confidence region of P D ( t | x ) is computed over the time interval [ 0 , 40 ] . Since the information provided in the right tail of the time distribution is sparse, Beran’s estimator performs very poorly, leading to extremely wide confidence regions. However, this problem is not as severe for the doubly smoothed Beran’s estimator, so the confidence region is computable for higher values of time. Figure 14 shows the confidence region of P D ( t | x ) based on the smoothed Beran’s estimator with t [ 0 , 60 ] . The average width of this confidence region is 0.2398 .
In practice, the financial institution measures different features of its clients, such as age, amount of money in the bank account, salary, years of employment, etc. They summarize, usually by logistic regression, these covariates into the single variable credit scoring. Subsequently, techniques such as those shown in this paper allow the calculation of the probability of default at horizon b for all of them. The curve P D ( t | x ) provides the probability that the client will default after a certain period of time b.

7. Conclusions and Future Lines

This article proposes an automatic bandwidth selector and confidence regions algorithms based on bootstrap selectors for Beran’s estimator and the smoothed Beran’s estimator of the probability of default in credit risk. The proposed resampling methods and bootstrap selectors allow to approximate the MISE bandwidths corresponding to each estimator: the covariate-smoothing bandwidth in the case of the Beran estimator and the two-dimensional covariate and time-smoothing bandwidth in the case of the doubly-smoothed estimator. In view of the simulation study carried out, it can be concluded that the bandwidth selectors work properly. The doubly smoothed Beran’s estimator with bootstrap bandwidths commits smaller estimation errors than Beran’s estimator. The simulation results also show the good behavior of the confidence regions, especially those based on the doubly smoothed Beran’s estimator. They have a lower average width, reducing the uncertainty about where the true probability of default curve is located, while preserving a high coverage.
The main limitation of proposed methods is their high computational cost. Approximating the bootstrap bandwidth or computing a confidence region by 500 resamples from a sample of size 100 requires one minute, while a sample of size 500 requires 25 min to obtain the result. These times are similar for both estimators. They seem to increase quadratically as the sample size grows, which may lead to prohibitive times for very large sample sizes. Using subsampling techniques is an appealing idea to be considered in the future for optimizing these methods.
In a financial context, credit scoring typically summarizes several interesting features of clients in order to measure their creditworthiness. However, this work could be extended to the case of having a multidimensional covariate ( X 1 , , X q ) , where each X i is a feature of the individual. Methods such as single-index can be useful for this purpose to avoid the curse of dimensionality. An approach along the lines similar to [31] can be used.

Author Contributions

Conceptualization, R.C. and J.M.V.; Data curation, R.P.; Formal analysis, R.P.; Investigation, R.P.; Supervision, R.C. and J.M.V.; Visualization, R.P.; Writing—original draft, R.P.; Writing—review & editing, R.P., R.C. and J.M.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been supported by MICINN Grant PID2020-113578RB-100, and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14 and Centro Singular de Investigación de Galicia ED431G 2019/01), all of them through the ERDF.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wiginton, J.C. A note on the comparison of logit and discriminant models of consumer credit behaviour. J. Financ. Quant. Anal. 1980, 15, 757–770. [Google Scholar] [CrossRef]
  2. Srinivasan, V.; Kim, Y.H. Credit granting: A comparative analysis of clasification procedures. J. Financ. 1987, 42, 665–681. [Google Scholar] [CrossRef]
  3. Steenackers, A.; Goovaerts, M.J. A credit scoring model for personal loans. Insur. Math. Econ. 1989, 8, 31–34. [Google Scholar] [CrossRef]
  4. Thomas, L.C.; Crook, J.N.; Edelman, D.B. Credit Scoring and Credit Control; Oxford University Press: Oxford, UK, 1992. [Google Scholar]
  5. Baba, N.; Goko, H. Survival Analysis of Hedge Funds; Bank of Japan, Working Papers Series; Bank of Japan: Tokyo, Japan, 2006. [Google Scholar]
  6. Samreen, A.; Zaidi, F. Design and development of credit scoring model for the commercial banks of Pakistan: Forecasting creditworthiness of individual borrowers. Int. J. Bus. Soc. Sci. 2012, 17, 155–166. [Google Scholar]
  7. Naraim, B. Survival analysis and the credit granting decision. In Credit Scoring and Credit Control; Thomas, L.C., Crook, J.N., Edelman, D.B., Eds.; Oxford University Press: Oxford, UK, 1992; pp. 109–121. [Google Scholar]
  8. Schuermann, T.; Hanson, S.G. Estimating probabilities of default. In Staff Report Federal Reserve Bank of New York; Federal Reserve Bank of New York: New York, NY, USA, 2004; pp. 923–947. [Google Scholar]
  9. Glennon, D.; Nigro, P. Measuring the default risk of small business loans: A survival analysis approach. J. Money Credit. Bank. 2005, 37, 923–947. [Google Scholar] [CrossRef]
  10. Allen, L.N.; Rose, L.C. Financial survival analysis of defaulted debtors. J. Oper. Res. Soc. 2006, 57, 630–636. [Google Scholar] [CrossRef]
  11. Beran, J.; Djaïdja, A.K. Credit risk modeling based on survival analysis with inmunes. Stat. Methodol. 2007, 4, 251–276. [Google Scholar] [CrossRef]
  12. Cao, R.; Vilar, J.M.; Devia, A. Modelling consumer credit risk via survival analysis (with discussion). Stat. Oper. Res. Trans. 2009, 33, 3–30. [Google Scholar]
  13. Peláez, R.; Cao, R.; Vilar, J.M. Probability of default estimation in credit risk using a nonparametric approach. TEST 2021, 30, 383–405. [Google Scholar] [CrossRef]
  14. Peláez, R.; Cao, R.; Vilar, J.M. Nonparametric estimation of probability of default with double smoothing. SORT 2021, 45, 93–120. [Google Scholar]
  15. Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
  16. Efron, B. Censored data and the bootstrap. J. Am. Stat. Assoc. 1981, 76, 312–319. [Google Scholar] [CrossRef]
  17. Efron, B.; Tibshirani, R. An Introduction to the Bootstrap; Chapman and Hall: London, UK, 1993. [Google Scholar]
  18. Akritas, M. Bootstrapping the Kaplan-Meier estimator. J. Am. Stat. Assoc. 1986, 81, 1032–1039. [Google Scholar]
  19. Lo, S.H.; Singh, K. The product-limit estimator and the bootstrap: Some asymptotic representations. Probab. Theory Relat. Fields 1986, 71, 455–465. [Google Scholar] [CrossRef]
  20. Van Keilegom, I.; Veraverbeke, N. Estimation and bootstrap with censored data in fixed design nonparametric regression. Ann. Inst. Stat. Math. 1997, 49, 467–491. [Google Scholar] [CrossRef]
  21. Li, G.; Datta, S. A bootstrap approach to nonparametric regression for right censored data. Ann. Inst. Stat. Math. 2001, 53, 708–729. [Google Scholar] [CrossRef]
  22. Geerdens, C.; Acar, E.F.; Janssen, P. Conditional copula models for right-censored clustered event time data. Biostatistics 2017, 19, 247–262. [Google Scholar] [CrossRef]
  23. Földes, A.; Rejtø, L.; Winter, B.B. Strong consistency properties of nonparametric estimators for randomly censored data, II: Estimation of density and failure rate. Period. Math. Hung. 1981, 12, 15–29. [Google Scholar] [CrossRef]
  24. Leconte, E.; Poiraud-Casanova, S.; Thomas-Agnan, C. Smooth conditional distribution function and quantiles under random censorship. Lifetime Data Anal. 2002, 8, 229–246. [Google Scholar] [CrossRef]
  25. Peláez, R.; Cao, R.; Vilar, J.M. Nonparametric Estimation of the Conditional Survival Function with Double Smoothing; Technical Report; Universidade da Coruña: A Coruña, Spain, 2021. [Google Scholar]
  26. Beran, R. Nonparametric Regression with Randomly Censored Survival Data; Technical Report; University of California: Los Angeles, CA, USA, 1981. [Google Scholar]
  27. Byrd, R.H.; Lu, P.; Nocedal, J.; Zhu, C. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 1995, 16, 1190–1208. [Google Scholar] [CrossRef]
  28. Zhu, C.; Byrd, R.H.; Lu, P.; Nocedal, J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 1997, 23, 550–560. [Google Scholar] [CrossRef]
  29. Cao, R.; Francisco-Fernández, M.; Quinto, E. A random effect multiplicative heteroscedastic model for bacterial growth. BMC Bioinform. 2010, 11, 77. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Winkler, R.L. A decision-theoretic approach to interval estimation. J. Am. Stat. Assoc. 1972, 67, 187–191. [Google Scholar] [CrossRef]
  31. Strzalkowska-Kominiak, E.; Cao, R. Maximum likelihood estimation for conditional distribution single-index models under censoring. J. Multivar. Anal. 2013, 114, 74–98. [Google Scholar] [CrossRef]
  32. Barnard, B. Rating Migration and Bond Valuation: Ahistorical Interest Rate and Default Probability Term Structures; University of the Witwatersrand, Wits Business School: Johannesburg, South Africa, 2017. [Google Scholar]
  33. dos Reis, G.; Smith, G. Robust and consistent estimation of generators in credit risk. Quant. Financ. 2018, 18, 983–1001. [Google Scholar] [CrossRef] [Green Version]
Figure 1. M I S E x ( h ) function (black line) approximated via Monte Carlo and M I S E x * ( h ) functions (gray lines) for N = 300 samples (textbftop), boxplot of H 1 * , . . . , H N * values (middle) and boxplot of R 1 * , . . . , R N * values (bottom) when the conditional probability of censoring is 0.2 (left) and 0.5 (right) in Model 1.
Figure 1. M I S E x ( h ) function (black line) approximated via Monte Carlo and M I S E x * ( h ) functions (gray lines) for N = 300 samples (textbftop), boxplot of H 1 * , . . . , H N * values (middle) and boxplot of R 1 * , . . . , R N * values (bottom) when the conditional probability of censoring is 0.2 (left) and 0.5 (right) in Model 1.
Mathematics 10 01523 g001
Figure 2. M I S E x ( h ) function (black line) approximated via Monte Carlo and M I S E x * ( h ) functions (gray lines) for N = 300 samples (top), boxplot of H 1 * , . . . , H N * values (middle) and boxplot of R 1 * , . . . , R N * values (bottom) when the conditional probability of censoring is 0.2 (left) and 0.5 (right) in Model 2.
Figure 2. M I S E x ( h ) function (black line) approximated via Monte Carlo and M I S E x * ( h ) functions (gray lines) for N = 300 samples (top), boxplot of H 1 * , . . . , H N * values (middle) and boxplot of R 1 * , . . . , R N * values (bottom) when the conditional probability of censoring is 0.2 (left) and 0.5 (right) in Model 2.
Mathematics 10 01523 g002
Figure 3. Theoretical probability of default function P D ( t | x ) (solid line), Beran’s estimation with MISE bandwidth (dotted line) and Beran’s estimation with bootstrap bandwidth (dashed line) for one sample from Model 1 (left) and Model 2 (right) with P ( δ = 0 | x ) = 0.5 .
Figure 3. Theoretical probability of default function P D ( t | x ) (solid line), Beran’s estimation with MISE bandwidth (dotted line) and Beran’s estimation with bootstrap bandwidth (dashed line) for one sample from Model 1 (left) and Model 2 (right) with P ( δ = 0 | x ) = 0.5 .
Mathematics 10 01523 g003
Figure 4. M I S E x ( h , g ) function approximated via Monte Carlo (left) and M I S E x * ( h , g ) function approximated via bootstrap (right) for one sample from Model 1 (top) and Model 2 (bottom) when P ( δ = 0 | x ) = 0.5 .
Figure 4. M I S E x ( h , g ) function approximated via Monte Carlo (left) and M I S E x * ( h , g ) function approximated via bootstrap (right) for one sample from Model 1 (top) and Model 2 (bottom) when P ( δ = 0 | x ) = 0.5 .
Mathematics 10 01523 g004
Figure 5. Boxplot of H 1 * , . . . , H N * values (top) and boxplot of R 1 * , . . . , R N * values (bottom) when the conditional probability of censoring is 0.2 (left) and 0.5 (right) in Model 1.
Figure 5. Boxplot of H 1 * , . . . , H N * values (top) and boxplot of R 1 * , . . . , R N * values (bottom) when the conditional probability of censoring is 0.2 (left) and 0.5 (right) in Model 1.
Mathematics 10 01523 g005
Figure 6. Boxplot of H 1 * , . . . , H N * values (top) and boxplot of R 1 * , . . . , R N * values (bottom) when the conditional probability of censoring is 0.2 (left) and 0.5 (right) in Model 2.
Figure 6. Boxplot of H 1 * , . . . , H N * values (top) and boxplot of R 1 * , . . . , R N * values (bottom) when the conditional probability of censoring is 0.2 (left) and 0.5 (right) in Model 2.
Mathematics 10 01523 g006
Figure 7. Theoretical probability of default function, P D ( t | x ) , (solid line), smoothed Beran’s estimation with MISE bandwidth (dotted line) and smoothed Beran’s estimation with bootstrap bandwidth (dashed line) for one sample from Model 1 (left), Model 2 (right) with P ( δ = 0 | x ) = 0.5 .
Figure 7. Theoretical probability of default function, P D ( t | x ) , (solid line), smoothed Beran’s estimation with MISE bandwidth (dotted line) and smoothed Beran’s estimation with bootstrap bandwidth (dashed line) for one sample from Model 1 (left), Model 2 (right) with P ( δ = 0 | x ) = 0.5 .
Mathematics 10 01523 g007
Figure 8. Theoretical P D ( t | x ) (red solid line), Beran’s estimation of P D ( t | x ) with MISE bandwidths (black dashed line) and bootstrap versions of Beran’s estimations of P D ( t | x ) from B = 500 resamples (gray dashed lines) of one sample from Model 2 when P ( δ = 0 | x ) = 0.5 .
Figure 8. Theoretical P D ( t | x ) (red solid line), Beran’s estimation of P D ( t | x ) with MISE bandwidths (black dashed line) and bootstrap versions of Beran’s estimations of P D ( t | x ) from B = 500 resamples (gray dashed lines) of one sample from Model 2 when P ( δ = 0 | x ) = 0.5 .
Mathematics 10 01523 g008
Figure 9. Theoretical P D ( t | x ) (red solid line) and estimation with MISE bandwidths (black dashed line) along with the bootstrap estimations of P D ( t | x ) from B = 500 resamples (gray dashed lines) in the left panel and 95 % confidence region (black dotted lines) in the right panel by means of Beran’s estimator (top) and the smoothed Beran’s estimator (bottom) for one sample from Model 1 when P ( δ = 0 | x ) = 0.5 .
Figure 9. Theoretical P D ( t | x ) (red solid line) and estimation with MISE bandwidths (black dashed line) along with the bootstrap estimations of P D ( t | x ) from B = 500 resamples (gray dashed lines) in the left panel and 95 % confidence region (black dotted lines) in the right panel by means of Beran’s estimator (top) and the smoothed Beran’s estimator (bottom) for one sample from Model 1 when P ( δ = 0 | x ) = 0.5 .
Mathematics 10 01523 g009
Figure 10. Theoretical P D ( t | x ) (red solid line) and estimation with MISE bandwidths (black dashed line) along with the bootstrap estimations of P D ( t | x ) from B = 500 resamples (gray dashed lines) in the left panel and 95 % confidence region (black dotted lines) in the right panel by means of Beran’s estimator (top) and the smoothed Beran’s estimator (bottom) for one sample from Model 2 when P ( δ = 0 | x ) = 0.5 .
Figure 10. Theoretical P D ( t | x ) (red solid line) and estimation with MISE bandwidths (black dashed line) along with the bootstrap estimations of P D ( t | x ) from B = 500 resamples (gray dashed lines) in the left panel and 95 % confidence region (black dotted lines) in the right panel by means of Beran’s estimator (top) and the smoothed Beran’s estimator (bottom) for one sample from Model 2 when P ( δ = 0 | x ) = 0.5 .
Mathematics 10 01523 g010
Figure 11. Scatter plot of credit scoring and duration of the credit in the censored group (red circles) and the uncensored group (blue triangles) of the German credit data.
Figure 11. Scatter plot of credit scoring and duration of the credit in the censored group (red circles) and the uncensored group (blue triangles) of the German credit data.
Mathematics 10 01523 g011
Figure 12. Conditional survival function estimation (left) and probability of default estimation (right) by means of Beran’s estimator (dashed line) and the smoothed Beran’s estimator (solid line) with bootstrap bandwidths when x = 0.85 in the German credit dataset.
Figure 12. Conditional survival function estimation (left) and probability of default estimation (right) by means of Beran’s estimator (dashed line) and the smoothed Beran’s estimator (solid line) with bootstrap bandwidths when x = 0.85 in the German credit dataset.
Mathematics 10 01523 g012
Figure 13. Estimation of P D ( t | x ) with bootstrap bandwidths (black line) along with bootstrap estimations of PD (gray lines) from B = 500 resamples (left) and 95 % confidence region (right) by Beran’s estimator (top) and the smoothed Beran’s estimator (bottom) when x = 0.85 and t [ 0 , 40 ] in the German credit dataset.
Figure 13. Estimation of P D ( t | x ) with bootstrap bandwidths (black line) along with bootstrap estimations of PD (gray lines) from B = 500 resamples (left) and 95 % confidence region (right) by Beran’s estimator (top) and the smoothed Beran’s estimator (bottom) when x = 0.85 and t [ 0 , 40 ] in the German credit dataset.
Mathematics 10 01523 g013aMathematics 10 01523 g013b
Figure 14. Estimation of P D ( t | x ) with bootstrap bandwidths (black line) along with bootstrap estimations of PD (gray lines) from B = 500 resamples (left) and 95 % confidence region (right) by the smoothed Beran’s estimator when x = 0.85 and t [ 0 , 60 ] in the German credit dataset.
Figure 14. Estimation of P D ( t | x ) with bootstrap bandwidths (black line) along with bootstrap estimations of PD (gray lines) from B = 500 resamples (left) and 95 % confidence region (right) by the smoothed Beran’s estimator when x = 0.85 and t [ 0 , 60 ] in the German credit dataset.
Mathematics 10 01523 g014
Table 1. MISE, average bootstrap bandwidths and estimation errors of Beran’s PD estimator in each level of censoring conditional probability for Models 1 and 2. Numbers within brackets are standard deviations.
Table 1. MISE, average bootstrap bandwidths and estimation errors of Beran’s PD estimator in each level of censoring conditional probability for Models 1 and 2. Numbers within brackets are standard deviations.
Model 1Model 2
P ( δ = 0 | X = x ) 0.2 0.5 0.2 0.5
h M I S E 0.375760.359090.094940.10959
R M I S E x ( h M I S E ) 0.055200.111440.279420.49991
h * ¯ ( s d ) 0.27856 (0.092)0.30892 (0.110)0.21763 (0.041)0.23091 (0.068)
H * ¯ 0.314310.293061.292111.10692
R M I S E x ( h * ) ¯ 0.057000.114050.296710.50824
R * ¯ 0.032600.023360.061880.01666
Table 2. MISE, average bootstrap bandwidths and estimation errors of the smoothed Beran’s PD estimator in each level of censoring conditional probability for Models 1 and 2. Numbers within brackets are standard deviations.
Table 2. MISE, average bootstrap bandwidths and estimation errors of the smoothed Beran’s PD estimator in each level of censoring conditional probability for Models 1 and 2. Numbers within brackets are standard deviations.
Model 1Model 2
P ( δ = 0 | X = x ) 0.2 0.5 0.2 0.5
h M I S E 0.216330.167350.111220.37551
g M I S E 0.092860.146121.277551.68878
R M I S E x ( h M I S E , g M I S E ) 0.037100.050940.098290.12322
h * ¯ ( s d ) 0.11736 (0.051)0.11219 (0.057)0.19813 (0.180)0.16593 (0.218)
g * ¯ ( s d ) 0.12647 (0.039)0.19671 (0.054)0.60005 (0.375)1.45428 (0.711)
H * ¯ 0.681210.636041.226090.89164
R M I S E x ( h * , g * ) ¯ 0.046200.067930.221350.28342
R * ¯ 0.245170.333571.251991.30003
Table 3. Comparative table of the estimation error of Beran’s estimator and the smoothed Beran’s estimator in Models 1 and 2.
Table 3. Comparative table of the estimation error of Beran’s estimator and the smoothed Beran’s estimator in Models 1 and 2.
Model 1Model 2
P ( δ = 0 | X = x ) 0.2 0.5 0.2 0.5
BERAN R M I S E x ( h * ) ¯ 0.055790.112060.285930.49916
SBERAN R M I S E ( h * , g * ) ¯ 0.046290.072160.200070.27611
R S 0.205230.552940.429150.80783
Table 4. MISE bandwidths and RMISE of Beran and smoothed Beran’s estimators in each level of censoring conditional probability for Models 1 and 2 when t n T + b = F 1 ( 0.70 | x ) .
Table 4. MISE bandwidths and RMISE of Beran and smoothed Beran’s estimators in each level of censoring conditional probability for Models 1 and 2 when t n T + b = F 1 ( 0.70 | x ) .
Model 1Model 2
P ( δ = 0 | X = x ) 0.2 0.5 0.2 0.5
BERAN h M I S E 0.3755100.3204080.0418370.057755
R M I S E x ( h M I S E ) 0.0194030.0259430.1933340.220733
SBERAN h M I S E 0.2306120.1969390.0942860.154490
g M I S E 0.0736730.0834690.9081631.071429
R M I S E x ( h M I S E , g M I S E ) 0.0136580.0181650.0261610.029007
Table 5. Coverage, average width and IWS of the 95 % confidence regions by means of Beran’s and the smoothed Beran’s estimators using N = 300 simulated samples from Models 1 and 2.
Table 5. Coverage, average width and IWS of the 95 % confidence regions by means of Beran’s and the smoothed Beran’s estimators using N = 300 simulated samples from Models 1 and 2.
Model 1
P ( δ = 0 | X = x ) 0.20.5
EstimatorBERANSBERANBERANSBERAN
Coverage (%)96.3390.6790.0085.33
Mean pointwisecoverage (%)99.9498.0599.6396.85
Mean width0.219970.095390.248270.10937
IWS0.098690.045370.112180.05571
Model 2
P ( δ = 0 | X = x ) 0.20.5
EstimatorBERANSBERANBERANSBERAN
Coverage (%)97.3383.0091.4698.00
Mean pointwisecoverage (%)99.8898.5399.6599.85
Mean width0.505140.179690.555810.33033
IWS0.625900.228250.710090.40845
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Peláez, R.; Cao, R.; Vilar, J.M. Bootstrap Bandwidth Selection and Confidence Regions for Double Smoothed Default Probability Estimation. Mathematics 2022, 10, 1523. https://doi.org/10.3390/math10091523

AMA Style

Peláez R, Cao R, Vilar JM. Bootstrap Bandwidth Selection and Confidence Regions for Double Smoothed Default Probability Estimation. Mathematics. 2022; 10(9):1523. https://doi.org/10.3390/math10091523

Chicago/Turabian Style

Peláez, Rebeca, Ricardo Cao, and Juan M. Vilar. 2022. "Bootstrap Bandwidth Selection and Confidence Regions for Double Smoothed Default Probability Estimation" Mathematics 10, no. 9: 1523. https://doi.org/10.3390/math10091523

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop