Importance Sampling in the Presence of PD-LGD Correlation

This paper seeks to identify computationally efficient importance sampling (IS) algorithms for estimating large deviation probabilities for the loss on a portfolio of loans. Related literature typically assumes that realised losses on defaulted loans can be predicted with certainty, i.e., that loss given default (LGD) is non-random. In practice, however, LGD is impossible to predict and tends to be positively correlated with the default rate and the latter phenomenon is typically referred to as PD-LGD correlation (here PD refers to probability of default, which is often used synonymously with default rate). There is a large literature on modelling stochastic LGD and PD-LGD correlation, but there is a dearth of literature on using importance sampling to estimate large deviation probabilities in those models. Numerical evidence indicates that the proposed algorithms are extremely effective at reducing the computational burden associated with obtaining accurate estimates of large deviation probabilities across a wide variety of PD-LGD correlation models that have been proposed in the literature.


Introduction
This paper seeks to identify computationally efficient importance sampling (IS) algorithms for estimating large deviation probabilities for the loss on a portfolio of loans. Related literature assumes that realised losses on defaulted loans can be predicted with certainty, i.e., that loss given default (LGD) is non-random. In practice, however, LGD is impossible to predict and tends to be positively correlated with the default rate and the latter phenomenon is typically referred to as PD-LGD correlation (here PD refers to probability of default, which is often used synonymously with default rate). There is a large literature on modelling stochastic LGD and PD-LGD correlation, but there is a paucity of literature on using importance sampling to estimate large deviation probabilities in those models. This gap in the literature was brought to our attention by a risk management professional at a large Canadian financial institution, and filling that gap is the ultimate goal of this paper.

Problem Formulation and Related Literature
Consider a portfolio of N exposures of equal size. Let L 1 , L 2 , . . . , L N denote the losses on the individual loans, expressed as a percentage of notional value. The percentage loss on the entire portfolio is:L We are interested in using IS to estimate large deviation probabilities of the form: where x >> E[L i ] = E[L N ] is some large, user-defined, threshold.
In practice the number of exposures is large (e.g., in the thousands) and prudent risk management requires one to assume that the individual losses are correlated. In practice, then,L N is the average of a large number of correlated variables. As such, its probability distribution is highly intractable and Monte Carlo is the method of choice for approximating p x . As the probability of interest is typically small (e.g., on the order of 10 −3 or 10 −4 ), the computational burden required to obtain an accurate estimate of p x using Monte Carlo can be prohibitive. For instance if p x is on the order of 10 −3 and N is on the order of 1000 then, in the absence of any variance reduction techniques, the sample size required to reduce the estimator's relative error 1 to 10% is on the order of one hundred thousand. Since each realisation ofL N requires simulation of one thousand individual losses, a sample size of 100, 000 requires one to generate one hundred million variables. If the desired degree of accuracy is reduced to 1%, the number of variables that must be generated increases to a staggering 10 billion.
Importance sampling (IS) is a variance reduction technique that has the potential to significantly reduce the computational burden associated with obtaining accurate estimates of large deviation probabilities. In the present context, effective IS algorithms have been identified for a variety of popular risk management models, but most are limited to the special case that loss given default (LGD) is non-random. The seminal paper in the area is (Glasserman and Li 2005), other papers include (Chan and Kroese 2010) and (Scott and Metzler 2015). It is well documented empirically, however, that portfolio-level LGD is not only stochastic, but positively correlated with the portfolio-level default rate as seen, for instance, in any of the studies listed in (Kupiec 2008) or (Frye and Jacobs 2012). This phenomenon is typically referred to as PD-LGD correlation. (Miu and Ozdemir 2006) show that ignoring PD-LGD correlation when it is in fact present can lead to material underestimates of portfolio risk measures.
There is a large literature on modelling PD-LGD correlation (Frye 2000); (Pykhtin 2003); (Miu and Ozdemir 2006); (Kupiec 2008); (Sen 2008); (Witzany 2011);(de Wit 2016); (Eckert et al. 2016); and others listed in (Frye and Jacobs 2012), but there is a much smaller literature on using IS to estimate large deviation probabilities in such models. To the best of our knowledge only (Deng et al. 2012) and (Jeon et al. 2017) have developed algorithms that allow for PD-LGD correlation (the former paper considers a dynamic intensity-based framework, the latter considers a static model with asymmetric and heavy-tailed risk factors). The present paper contributes to this nascent literature by developing algorithms that can be applied in a wide variety of PD-LGD correlation models that have been proposed in the literature, and are popular in practice.
The paper is structured as follows. Section 2 outlines important assumptions, notation, and terminology. Section 3 theoretically motivates the proposed algorithm in a general setting, and Section 4 discusses a few practical issues that arise when implementing the algorithm. Section 5 describes a general framework for PD-LGD correlation modelling that includes, as special cases, many of the models that have been developed in the literature and Section 6 describes how to implement the proposed algorithm in this general framework. Numerical results are presented and discussed in Section 7 and demonstrate that the proposed algorithms are extremely effective at reducing the computational burden required to obtain an accurate estimate of p x . 1 Relative error is the preferred measure of accuracy for large deviation probabilities. If p x is an estimator of p x , its relative error is defined as SD( p x )/p x , where SD denotes standard deviation.

Assumptions, Notation and Terminology
We assume that individual losses are of the form L i = L(Z, Y i ), where L is some deterministic function, Z = (Z 1 , . . . , Z d ) is a d-dimensional vector of systematic risk factors that affect all exposures, and Y i is a vector of idiosyncratic risk factors that only affect exposure i. We assume that Z, Y 1 , Y 2 , . . . are independent, and that the Y i are identically distributed. The primary role of the systematic risk factors is to induce correlation among the individual exposures, and it is common to interpret the realised values of the systematic risk factors as determining the overall macroeconomic environment. It is worth noting that the we do not require the components of Z to be independent of one another, etc. for the components of Y i .

Large Portfolios and the Region of Interest
In a large portfolio, the influence of the idiosyncratic risk factors is negligible. Indeed, since individual losses are conditionally independent, given the realised values of the systematic risk factors, we have the almost sure limit: lim where Since µ(Z) ≈L N for large N by Equation (3), the random variable µ(Z) is often called the large portfolio approximation (LPA) toL N . The LPA is often used to formalise the intuitive notion that, in a large portfolio, all risk is systematic (i.e., idiosyncratic is "diversified away"). We define the region of interest as the set: The region of interest is "responsible" for large deviations in the sense that: for most values 2 of x. Together, Equations (3) and (6) suggest that for large portfolios, it is relatively more important to identify an effective IS distribution for the systematic risk factors, as compared to the idiosyncratic risk factors.

Systematic Risk Factors
We assume that Z is continuous and let f (z) denote its joint density. We assume that f is a member of an exponential family (see Bickel and Doksum 2001 for definitions and important properties) with natural sufficient statistic S : R d → R p . Any other member of the family can be put in the form: where K(·) is the cumulant generating function (cgf) of S(Z) and λ ∈ R p is such that K(λ) is well-defined. The parameter λ is called the natural parameter of the family in Equation (7). Appendix B embeds the Gaussian and multivariate t families into this general framework. 2 In light of the almost sure limit in Equation (3), we have thatL N converges to µ(Z) in distribution, which implies that Equation (6) is valid for all values of x such that P(µ(Z) = x) = 0. If µ(Z) is a continuous random variable (which it is in most cases of practical interest) then Equation (6) is satisfied for every value of x.
We will eventually be using densities of the form in Equation (7) as IS densities for the systematic risk factors. The associated IS weight is: and it will be important to know when the variance of the IS weight is finite. The following observation is readily verified.

Remark 1.
If Z ∼ f λ , then Equation (8) has finite variance if and only if both K(λ) and K(−λ) are well defined.
A standard result in the theory of exponential families is that: where ∇ denotes gradient and E λ denotes expectation with respect to the density f λ .

Individual Losses
We assume that L i takes values in the unit interval. In general L i will have a point mass at zero (if it did not, the loan would not be prudent) and the conditional distribution of L i , given that L i > 0, is called the (account-level) LGD distribution. We allow the LGD distribution to be arbitrary in the sense that it could be either discrete or continuous, or a mixture of both. This contrasts with the case of non-random LGD, where the LGD distribution is degenerate at a single point. We let max ∈ (0, 1] denote the supremum of the support of L i . Individual losses will therefore never exceed max but could take on values arbitrarily close (and possibly equal) to max .
Remark 2. Despite the fact that L i is not a continuous variable, in what follows we will proceed as if it was and make repeated reference to its "density." This is done without loss of generality, and in an interest of simplifying the presentation and discussion. Nothing in the sequel requires L i to be a continuous variable, and everything carries over to the case where it is either discrete or continuous, or has both a discrete and continuous component.
For z ∈ R d we let g( |z) denote the conditional density of L i , given that Z = z. We assume that the support of g(·|z) is identical to the unconditional support, in particular it does not depend on the value of z. Note that µ(z) is the mean of g(·|z).
In practice (i.e., for all of the PD-LGD correlation models listed in the introduction) g(·|z) is not a member of an established parametric family, and direct simulation from g(·|z) using a standard technique such as inverse transform or rejection sampling is not straightforward. Simulation from g(·|z) is most easily accomplished by simulating the idiosyncratic risk factors, Y i , from their density, say η(y), and then setting L i = L(z, Y i ). In other words, in order to simulate from g(·|z) we make use of the fact that L i = L(z, Y i ) is a drawing from g(·|z) whenever Y i is a drawing from η(·).
Then k(·, z) is the conditional cgf of L i , given that Z = z, and k (·, z) is its first derivative. In practice, neither k(·, z) nor k (·, z) is available in closed form.
In the examples we consider later in the paper each can be expressed as a one-dimensional integral, but the numerical values of those integrals must be approximated using quadrature. This contrasts with the case of non-random LGD, where the conditional cgf can be computed in closed form 3 . For x ∈ (0, max ) and z ∈ R d we letθ(x, z) denote the unique solution to the equation k (θ, z) = max(x, µ(z)). We often suppress dependence on x and z, and simply writeθ instead ofθ(x, z). Thatθ is well-defined follows immediately from the developments in Appendix A.1. Based on the discussion there we find thatθ is zero whenever z lies in the region of interest, and is strictly positive otherwise.
Remark 3. In practice, the value ofθ cannot be computed in closed form and must be approximated using a numerical root-finding algorithm. Since each evaluation of the function k (·, z) requires quadrature, computinĝ θ is straightforward but relatively time consuming. This contrasts with the case of non-random LGD, whereθ can be computed in closed form at essentially no cost.
For z ∈ R d we let q(·, z) denote the Legendre transform of k(·, z) over [0, ∞). That is, Thatθ is the uniquely defined point at which the function θ → θx − k(θ, z) attains its maximum on [0, ∞) follows from the developments in Appendix A.2. Based on the discussion there, we find that bothθ and q are equal to zero whenever z lies in the region of interest, and that both are strictly positive otherwise.

Conditional Tail Probabilities
Given the realised values of the systematic risk factors, individual losses are independent. Large deviations theory can therefore provide useful insights into the large-N behaviour of the tail probability P(L N ≥ x|Z = z). For instance, Chernoff's bound yields the estimate: and Cramér's (large deviation) theorem yields the limit: Together these results are often used to justify the approximation: which will be used repeatedly throughout the paper. The approximation in Equation (13) is often called the large deviation approximation (LDA) to the tail probability P(L N > x|Z = z). Note that since q(x, z) = 0 whenever µ(z) ≥ x, the LDA suggests that P(L N > x|Z = z) ≈ 1 whenever z lies in the region of interest.
where A N,x is the set of points ∈ [0, max ] N for which N −1 ∑ N i=1 i > x. We let f x (z) denote the conditional density of the systematic risk factors, given thatL N > x, noting that: In the examples we consider the mean of f x tends to lie inside, but close to the boundary of, the region of interest. And relative to the unconditional density f , the conditional density f x tends to be much more concentrated about its mean. Finally, we let g x ( |z) denote the conditional density of an individual loss, given that Z = z and L N > x, noting that: If the realised value of z lies inside the region of interest, the conditional density g x (·|z) tends to resemble the unconditional density g(·|z). Intuitively, for such values of z the LDA informs that the event {L N > x} is very likely, and conditioning on its occurrence is not overly informative. If the realised value of z does not lie in the region of interest then g x (·|z) tends to resemble the exponentially tilted version of g(·|z) whose mean is exactly x. See Appendix A.3 for more details.
Neither h x , f x , nor g x are numerically tractable, but as we will soon see they do serve as useful benchmarks against which to compare candidate IS densities. In addition, it is worth noting here that the representations of Equations (15) and (16) lend themselves to numerical approximation via the LDA in Equation (13).

Proposed Algorithm
In practice, the most common approach to estimating p x via Monte Carlo simulation in this framework is summarised in Algorithm 1 below.
Algorithm 1 consists of two stages. In the first stage one simulates the systematic risk factors, and in the second stage one simulates the idiosyncratic risk factors for each exposure. Mathematically, the first stage induces independence among the individual exposures, so that the second stage amounts to simulating a large number of i.i.d. variables. Intuitively, it is useful to think of the first stage as determining the prevailing macroeconomic environment, which fixes economy-wide quantities such as default and loss-given-default rates. The second stage of the algorithm overlays idiosyncratic noise on top of economy-wide rates, to arrive at the default and loss-given-default rates for a particular portfolio.
Relative error is the preferred measure of accuracy for estimators of rare event probabilities. The relative error of the estimator p x in Algorithm 1 is: and the sample size required to ensure the relative error does not exceed some predetermined threshold is: The number of variables that must be generated in order to achieve the desired degree of accuracy is therefore (N + d) · M( ), which grows without bound as p x → 0. For instance if p x = 10 −3 , N = 10 3 , d = 2, and = 5 · 10 −2 then the number of variables that must be generated is approximately four hundred million, which is an enormous computational burden for a modest degree of accuracy.
In the next section we discuss general principles for selecting an IS algorithm that can reduce the computational burden required to obtain an accurate estimate of p x .

General Principles
For practical reasons, we insist that our IS procedure retains conditional independence of individual losses, given the realised value of the systematic risk factors. This is important because it allows us to reduce the problem of simulating a large number of dependent variables to the (much) more computationally efficient problem of simulating a large number of independent variables.
In the first stage we simulate the systematic risk factors from the IS density f IS (z). The IS weight associated with this first stage is therefore: In the second stage we simulate the individual losses as i.i.d. drawings from the density g IS ( |z). The IS weight associated with this second stage is: and the IS density from which we sample (Z, L) is therefore of the form: The so-described algorithm, with as-yet unspecified IS densities, is summarised in Algorithm 2.
Algorithm 2 IS Algorithm for Estimating p x 1: Simulate M i.i.d. copies of the systematic risk factors from the density f IS (z). Think of these as different economic scenarios and denote the simulated values by z 1 , . . . , z M . 2: For each scenario m: (a) Independently simulate 1,m , 2,m , . . . , N,m from the density g IS (·|z m ).
It is important to note that in the second stage, we will not be simulating individual losses directly from the (conditional) IS density g IS . Rather, we will simulate the idiosyncratic risk factors Y i in such a way as to ensure that for a given value of z, the variable L i = L(z, Y i ) has the desired density g IS .
Focusing on the "indirect" IS density of L i , as opposed to "direct" IS density of Y i , allows us to identify a much more effective second stage algorithm 4 .
The estimator p x produced by Algorithm 2 is demonstrably unbiased and its variance is: where E IS denotes expectation under the IS distribution, Λ(z, ) := Λ 1 (z) · Λ 2 (z, ) and Note that Λ x is the ratio of (i) the IS density in Equation (18) to (ii) the conditional density in Equation (14). The estimator's squared relative error can then be decomposed as: where P IS denotes probability under the IS distribution. Inspecting Equation (20) we see that an effective IS density should (i) assign a high probability to the event of interest and (ii) should resemble the conditional density in Equation (14) as closely as possible, in the sense that the ratio Λ x should deviate as little as possible from unity. Clearly, an estimator that satisfies (ii) should also satisfy (i), since h x assigns probability one to the event that L N > x. The task now is to identify a density of the form in Equation (18) that resembles the ideal density in Equation (14), in some sense.

Identifying the Ideal IS Densities
Our measure of similarity is Kullback-Leibler divergence (KLD), or divergence for short. See Chatterjee and Diaconis (2018) for a general discussion of the merits of minimum divergence as a criteria for identifying effective IS distributions. We begin by writing: where for fixed z,g is the joint density of N independent variables having marginal density g(·|z), conditioned on their average value exceeding the threshold x, and is the joint density of N independent variables having marginal density g IS (·|z). Using Equation (21) it is straightforward to decompose the divergence of h IS from h x as: where D(ξ||η) denotes the divergence of the density ξ from the density η. The first term in Equation (22) is the divergence of f IS from f x , and is therefore minimised by setting f IS = f x . In other words, the best 4 In the earliest stages of this project we focused directly on an IS density for Y i and had difficulties identifying effective candidates.
possible IS density for the systematic risk factors (according to the criteria of minimum divergence) is the conditional density f x . The second term in Equation (22) is the average divergence ofg IS (·|z) fromg x (·|z), averaged over all possible realisations of the systematic risk factors and conditioned on portfolio loss exceeding the threshold. Based on the developments in Appendix A.5, for fixed z ∈ R d the divergence ofg IS (·|z) fromg x (·|z) is minimised by setting g IS (·|z) = g x (·|z). The average divergence in Equation (22) is, therefore, also minimised by setting g IS (·|z) = g x (·|z) for every z ∈ R d .
Remark 4. Among all densities of the form in Equation (18), the one that most resembles the ideal density h x (in the sense of minimum divergence) is the density: In other words,ĥ x is the best possible IS density (among the class Equation (18) and according to the criteria of minimum divergence) from which to simulate (Z, L).
It is worth noting that the IS densityĥ x "gets marginal behaviour correct", in the sense that the marginal distribution of the systematic risk factors, as well as the marginal distribution of an individual loss, is the same underĥ x as it is under the ideal density h x . The dependence structure of individual losses is different underĥ x and h x -this is the price that we must pay for insisting on conditional independence (i.e., computational efficiency).

Approximating the Ideal IS Densities
Simulating directly fromĥ x requires an ability to simulate directly from f x and g x . Unfortunately, neither f x nor g x is numerically tractable (witness the unknown quantities in Equations (15) and (16)), and it does not appear that either is amenable to direct simulation. Our next task is to identify tractable densities that resemble f x and g x .

Systematic Risk Factors
As a tractable approximation to f x , we suggest using that member of the parametric family in Equation (7) that most resembles f x in the sense of minimum divergence. Using Equations (7) and (15) we get that: whence the divergence of f λ from f x is: As a cgf, K(·) is strictly convex. As such, Equation (23) attains its unique minimum at that value of λ such that: which, in light of Equation (9), is equivalent to: Intuitively, we suggest using that value of the IS parameter λ for which the mean of S(Z) under the IS density matches the conditional mean of S(Z), given that portfolio losses exceed the threshold. In what follows we letλ x denote that suggested value of the IS parameter λ, i.e., that value of λ that solves Equation (24).
Remark 5. The first-stage IS weight associated with the so-described density is: It is entirely possible-and quite common in the examples we consider in this paper-that K(−λ x ) is not well-defined, in which case Equation (26) has infinite variance under fλ x (recall Remark 1). At first glance it might seem absurd to consider IS densities whose associated weights have infinite variance, but as we discuss in Section 4.2 it is straightforward to circumvent this issue by trimming large first-stage IS weights 5 .
It remains to develop a tractable approximation to the right hand side of Equation (24), so that we can approximate the value ofλ x . To this end we write the natural sufficient statistic as S(z) = (S 1 (z), . . . , S p (z)) and note that: Next, we use the LDA in Equation (13) to get: As it only involves the systematic risk factors (and not the large number of idiosyncratic risk factors), the expectation on the right hand side of Equation (27) is amenable to either quadrature or Monte Carlo simulation.

Individual Losses
We encourage the reader unfamiliar with exponential tilts to consult Appendix A.3, before reading the remainder of this section. Our approximation to g x ( |z) is obtained by using the LDA of Equation (13) to approximate both conditional probabilities appearing in Equation (16) (see Appendix A.4 for details). The resulting approximation is: where we recall thatθ is defined and discussed in Section 2.3. If the realised values of the systematic risk factors obtained in the first stage lie in the region of interest thenθ = 0 andĝ x is identical to g. Otherwise,θ is strictly positive andĝ x is the exponentially tilted version of g whose mean is x. Intuitively, we can interpretĝ x as that density that most resembles (in the sense of minimum divergence) g x , among all densities whose mean is at least x, and the numerical value ofθ as the degree to which the density g(·|z) must be deformed, in order to produce a density whose mean is at least x. (28) is max(µ(z), x). The implication is that the event of interest is not a rare event under the proposed IS algorithm. Indeed,

Remark 6. The mean of Equation
The second-stage IS weight associated with Equation (28) is: Since the second stage weight depends only on Z andL N , we will often write Λ 2 (Z,L N ) instead of Λ 2 (Z, L). In order to assess the stability of the second-stage IS weight, we note that: If Z lies in the region of interest thenθ = q = 0, whence Λ 2 (Z,L N ) = 1 whatever the value ofL N . Otherwise, bothθ and q are strictly positive, which implies that Λ 2 (Z,L N ) < 1 wheneverL N > x. The net result of this discussion is that: The implication is that large, unstable, IS weights in the second stage will never be a problem.
If the realised value of z does lie in the region of interest thenĝ x and g are identical, and simulation from g is straightforward. Our final task is to determine how to sample from Equation (28) in the case where z does not lie in the region of interest. One approach would be to identify a family of densities but this approach appears to be overly complicated. A simpler approach is to sample from Equation (28) using rejection sampling with g as the proposal density. To this end, we note that for fixed z, the ratio ofĝ x to g is exp(θ − k(θ, z)), which is bounded and strictly increasing on [0, max ]. The best possible (i.e., smallest) rejection constant is therefore: and the algorithm for sampling fromĝ x would proceed as follows. First, sample Y i from its actual density and setL i = L(z, Y i ). Then generate a random number U, uniformly distributed on [0, 1] and independent of Y i . If, set L i =L i and proceed to the next exposure. Otherwise return to the first step and sample another pair (Y i , U).

Summary and Intuition
The proposed algorithm is summarised in Algorithm 3 below. The initial step is to approximate the value of the first-stage IS parameter,λ x . In our numerical examples we use a small pilot simulation (10% of the sample size that we eventually use to estimate p x ) and the approximation of Equation (27) in order to estimateλ x .
Having computedλ x , the first stage of the algorithm proceeds by simulating independent realisations of the systematic risk factors from the density fλ x , and computing the associated first-stage weights of Equation (26). Recall that we can interpret these realisations as corresponding to different economic scenarios. Intuitively, sampling from fλ x instead of f increases the proportion of adverse scenarios that are generated in the first stage. In the examples we consider, fλ x concentrates most of its mass near the boundary of the region of interest, and the effect is to concentrate the distribution of In the second stage, one first checks whether or not the realised values of the systematic risk factors lie inside the region of interest. If they do then the event of interest is no longer rare and there is no need to apply further IS in the second stage. Otherwise, if we "miss" the region of interest in the first stage, we "correct" this mistake by applying an exponential tilt to the conditional distribution of individual losses. Specifically, we transfer mass from the left tail of g to the right tail, in order to produce a density whose mean is exactly x.
Algorithm 3 Proposed IS Algorithm for Estimating p x 1: Computeλ using a small pilot simulation. 2: Simulate M i.i.d. copies of the systematic risk factors from fλ(z) and compute the corresponding first-stage IS weights. Denote the realised values of the factors by z 1 , . . . , z M and the associated IS weights by Λ 1 (z 1 ), . . . , Λ 1 (z M ). 3: For each scenario m, determine whether or not z m lies in the region of interest (i.e., whether or not µ(z m ) ≥ x). If it does lie in the region, proceed as follows: (a) Simulate the idiosyncratic risk factors for each exposure. Denote the simulated values by y 1,m , . . . , y N,m .
Simulate the exposure's idiosyncratic risk factor (denote the realised value byŷ i,m ) and setˆ i,m = L(z m , y i,m ).
(ii) Simulate a random number drawn uniformly from the unit interval (denote the realised value by u) and determine whether or not and proceed to the next exposure. Otherwise, return to step (i).

Practical Considerations
In this section we discuss some of the practical issues that arise when implementing the proposed methodology.

One-and Two-Stage Estimators
The rejection sampling procedure employed in the second stage of the proposed algorithm involves repeated evaluation ofθ, which requires a non-trivial amount of computational time time. In addition, rejection sampling in general requires relatively complicated code. As such, it is worth considering a simpler algorithm that only applies importance sampling in the first stage, and is therefore easier to implement and faster to run.
In what follows we will distinguish between one-and two-stage IS algorithms. A one-stage algorithm only applies IS in the first stage and samples (Z, L) from the IS density: The associated IS weight is Λ 1 (z) and the one-stage algorithm is summarised in Algorithm 4 below. Note the simplicity of Algorithm 4, relative to Algorithm 3. The two-stage algorithm applies IS in both the first stage and the second stage, sampling (Z, L) from the IS density: The associated IS weight is Λ 1 (z) · Λ 2 (z,¯ N ), and the two-stage algorithm was summarised previously in Algorithm 3.
Algorithm 4 Proposed One-Stage IS Algorithm for Estimating p x 1: Computeλ x using a small pilot simulation. 2: Simulate M i.i.d. copies of the systematic risk factors from fλ(z) and compute the corresponding first-stage IS weights. Denote the realised values of the factors by z 1 , . . . , z M and the associated IS weights by Λ 1 (z 1 ), . . . , Λ 1 (z M ). 3: For each scenario m: (a) Simulate the idiosyncratic risk factors for each exposure. Denote the simulated values by y 1,m , . . . , y N,m .
Although it is simpler to implement and faster to run, the one-stage algorithm is less accurate than the two-stage algorithm. More precisely, the two-stage estimator never has larger variance than the one-stage estimator. To see this, first let E 1S denote expectation under the one-stage IS density h 1S (z, ) given in Equation (31). Then the variance of the one-stage estimator is: where M denotes sample size. And if we let E 2S denote expectation under the two-stage IS density h 2S (z, ) given in Equation (32) then the variance of the two-stage estimator is: In order to compare variances it suffices to compare the second moments appearing above under the actual density h(z, ), and we let E denote expectation with respect to this density. To this end we note that: In light of Equation (29) we get that: The two-stage estimator will therefore never have larger variance than the the one-stage estimator.

Large First-Stage Weights
In the examples that we consider in this paper, the systematic risk factors are Gaussian. When selecting their IS density, one could either (i) shift their means and leave their variances (and correlations) unchanged or (ii) shift their means and adjust their variances (and correlations). In general the latter approach will lead to a much better approximation to the ideal density f x , but could lead to an IS weight that has infinite variance. By contrast, the former approach will always lead to an IS weight with finite variance, but could lead to a poor approximation of the ideal density. At first glance it might seem absurd to consider IS densities whose weights are so unstable as to have infinite variance, but we have found that adjusting the variances of the systematic risk factors can lead to more effective estimators, in terms of both statistical accuracy and run time (see Section 6.1 for more details), provided one stabilises the resulting IS weights in some way. In the remainder of this section we describe a simple stabilisation technique that leads to a computable upper bound on the associated bias (an alternative would be to stabilize unruly IS weights via truncation, as discussed in Ionides (2008)).
Returning now to the general case, suppose that the first-stage IS parameter,λ x , is such that the first-stage IS weight, Λ 1 (Z), has infinite variance. We trim large first-stage weights by fixing a set A ⊂ R d such that Λ 1 (·) is bounded over A, and discarding those simulations for which Z / ∈ A. Specifically, the last line of Algorithm 3 would be altered to return the trimmed estimate: etc. for Algorithm 4. The variance of the so-trimmed estimator is necessarily finite (recall that Λ 2 (z,¯ ) ≤ 1 if¯ > x), and its bias is: where we have used the tower property (conditioning on Z) to obtain the last equality. Using Chernoff's bound in Equation (11) we get that: As it only depends on the small number of systematic risk factors, and not the large number of idiosyncratic risk factors, the right-hand side of Equation (34) is a tractable upper bound on the bias committed by trimming large (first-stage) IS weights. This upper bound can be used to assess whether or not the bias associated with a given set A is acceptable.

Large Rejection Constants
The smaller theĉ, the more efficient is the rejection sampling algorithm employed in the second stage. Indeed the average number of proposals that must be generated in order to obtain one realisation fromĝ x is 1/ĉ. In the examples we consider in this paper,ĉ is (essentially) a decreasing function µ(z), such thatĉ → 1 as µ(z) → x andĉ → ∞ as µ(z) → 0 (see Figure 1). The second-stage rejection algorithm is therefore quite efficient when µ(z) ≈ x and quite inefficient when µ(z) ≈ 0. Now, the IS density for the first-stage risk factors is such that the distribution of µ(Z) concentrates most of its mass near x (whereĉ is a reasonable size), but it is still theoretically possible to obtain a realisation of the systematic risk factors for which µ(z) is very small andĉ is unacceptably large (e.g., 10 4 ). In such situations the algorithm effectively grinds to a halt, as one endlessly generates proposed losses that have no realistic chance of being accepted. It is extremely unlikely that one does obtain such a scenario under the first-stage IS distribution, but it is still important to protect oneself against this unlikely event. To this end we suggest fixing some maximum acceptable rejection constant c max , and only applying the second stage IS to those first-stage realizations for which µ(z) < x andĉ ≤ c max . In other words, even if the realised values of the systematic risk factors lie outside the region of interest, we avoid applying the second stage if the associated rejection constant exceeds the predefined threshold.

Computingθ
Repeated evaluations ofθ(x, ·) are necessary when computingλ x at the outset of the algorithm, as well as during the second stage of the two-stage algorithm. Recall that in order to computeθ(x, z) "exactly" one must numerically solve the equation k (θ, z) = x, which requires a non-trivial amount of CPU time. As each evaluation ofθ is relatively costly, repeated evaluation would, in the absence of any further approximation (over and above that inherent in numerical root-finding), account for the vast majority of the algorithm's total run time.
In order to reduce the amount of time spent evaluatingθ we fit a low degree polynomial to the functionθ(x, ·) that can be evaluated extremely quickly, considerably reducing total run time. Specifically, suppose that we must computeθ(x, z n ) for each of n points z 1 , . . . , z n (either the sample points from the pilot simulation, or the first-stage realisations that did not land in the region of interest). We identify a small set C ⊂ R d that contains each of the n points, construct a mesh of m << n points in C, evaluateθ exactly at each mesh point, and then fit a fifth degree polynomial to the resulting data. Lettingθ(x, ·) denote the resulting polynomial, we then evaluateθ(x, z 1 ), . . . ,θ(x, z n ) instead ofθ(x, z 1 ), . . . ,θ(x, z n ). If m is substantially smaller than n, then the reduction in CPU time is considerable.

PD-LGD Correlation Framework
All of the PD-LGD correlation models listed in the introduction are special cases of the following general framework-an observation that, to the best of our knowledge, has not been made in the literature. The systematic risk factors take the form Z = (Z D , Z L ), where Z D and Z L are bivariate normal with standard normal margins and correlation ρ S . Idiosyncratic risk factors take the form Associated with each exposure is a default driver X i,D and a loss driver X i,L , defined as follows: The factor loadings α D and α L are constants taking values in the unit interval, and dictate the relative importance of systematic risk versus idiosyncratic risk. The correlation between default drivers of distinct exposures is ρ D := α 2 D and the correlation between loss drivers of distinct exposures is ρ L := α 2 L . The correlation between the default and potential loss drivers of a particular exposure is: which can be positive or negative (or zero). Note that if ρ S and ρ I have the same sign then, since both factor loadings are positive, ρ DL inherits this common sign. The realised loss on exposure i is L i = D i · L i , where: is the default indicator associated with exposure i and is called the potential loss (our terminology) associated with exposure i. Here P denotes the common default probability of all exposures and h is some function from R to [0, max ]. It is useful (but not necessary) to think of potential loss as L i = max(0, 1 − C i ), where C i is the value of the collateral pledged to exposure i expressed as a fraction of the loan's notional value.
Models in this framework are characterised by (i) the correlation structure of the risk factors, specifically restrictions on the values of ρ I and ρ S , and (ii) the marginal distribution of potential loss. For instance:
Note that if ρ S = ±1 then the systematic risk factor is effectively one-dimensional. Indeed if ρ S = 1 then Z = (Z, Z) from some standard Gaussian variable Z, and if ρ S = −1 then Z = (Z, −Z). We refer to the case |ρ S | = 1 as the one-factor case, and the case |ρ S | < 1 as the two-factor case. In the one-factor case we use Z, and not Z, to denote the systematic risk factor. The first two models listed above are one-factor models, the last two are two-factor models.
The marginal distribution of potential loss is determined by the specification of the function h. For instance:

•
Frye (2000) specifies h(x) = max(0, 1 − a(1 + bx)) for constants a ∈ R and b > 0. Potential loss takes values in [0, ∞). Its density has a point mass at zero and is proportional to a Gaussian density on (0, ∞). Since L i is not constrained to lie in the unit interval, this specification violates the assumptions made in Section 2.3; • Pykhtin (2003) specifies h(x) = max(0, 1 − e a+bx ) for constants a ∈ R and b > 0. Potential loss takes values in [0, 1). Its density has a point mass at zero, and is proportional to a shifted lognormal density over (0, 1); • Witzany (2011) and Miu and Ozdemir (2006) both specify h(x) = B −1 a,b (Φ(x)), where a, b > 0 and B a,b denotes the cdf of the beta distribution with parameters a and b. Potential loss takes values in (0, 1). It is a continuous variable and follows a beta distribution.
The sign of ρ DL and the nature of the function h (increasing or decreasing) will in general determine the sign of the relationship between D i and L i . If ρ DL > 0 then the relationship will be positive [negative] provided h is decreasing [increasing], and vice versa if ρ DL < 0.

Computing µ(z)
Here vectors z ∈ R 2 take the form z = (z D , z L ) T . In order to obtain an expression for µ(z) = E[L i |Z = z], we begin with the observation that: Thus, where are the conditional mean and variance of X i,D , respectively, given that (X i,L , Z) = (x L , z). In general µ(z) must be evaluated using quadrature, and doing so is straightforward 6 . On average (across parameter values and points z ∈ R 2 ) a single evaluation of µ(·) requires approximately one millisecond.

Computing k(θ, z) andθ(x, z)
Here again, vectors z ∈ R 2 take the form z = (z D , z L ) T . In order to derive an expression for k(θ, z) we begin with the observation that: and since k(θ, z) = log(E[e θL i |Z = z]), we get that: where m(x L , z) and v are given in the previous section. In the one-factor case with into Equation (38). As with µ(z), k(θ, z) must in general be evaluated using quadrature, which is straightforward. The time required for a single evaluation of k(θ, ·) is comparable to that required for a single evaluation of µ(·).
In order to computeθ we must solve the equation k (θ, z) = x with respect to θ. Differentiating Equation (38) we get: which is straightforward to compute using quadrature. A single evaluation of k (θ, z) requires approximately twice as much time as a single evaluation of k(θ, z). As the root of k (θ, z) = x must be evaluated numerically, evaluatingθ is much more time consuming than evaluating k or k .
Across parameter values and points z ∈ R 2 , and using θ = 0 as an initial guess, the average time required for a single evaluation 7 ofθ(x, ·) is slightly less than one tenth of one second. The right panel of Figure 1 illustrates the relationship between expected losses and the rejection constant employed in the second stage,ĉ = exp(θ − k(θ, z)). We see thatĉ is essentially a decreasing function of µ(z), such thatĉ → 1 as µ(z) → x andĉ → ∞ as µ(z) → 0. The left panel of Figure 1 illustrates the graph of the LDA approximation P(L N > x|Z = z) ≈ exp(− Nq(x, z)). The approximation is identically equal to one inside the region of interest, and decays to zero very rapidly outside the region. In other words, most of the variability in the function q(x, ·) occurs along, and just outside, the boundary of the region of interest. 6 All calculations are carried out using Matlab 2018a on a 2015 MacBook Pro with 6.8 GHz Intel Core i7 processor and 16 GB (1600 MHz) of memory. Numerical integration is performed using the built-in integral function. 7 We use the Matlab function fzero for the root-finding.

Exploring the Parameter Space
The model contains five parameters, in addition to any parameters associated with the transformation h. We are ultimately interested in how well the proposed algorithms perform across a wide range of different parameter sets. As such, in our numerical experiments we will randomly select a large number of parameter sets according to the procedure described below, and assess the algorithms' performance for each parameter set.

•
Generate the default probability P uniformly between 0% and 10%, and generate each of the correlations ρ D = α 2 D and ρ L = α 2 L uniformly between 0% and 50%; • In the one-factor model, generate ρ S uniformly on {−1, 1}, i.e., ρ S takes on the value −1 or +1 with equal probability. If ρ S = 1 we generate ρ I uniformly between 0% and 100%, and if ρ S = −1 we generate ρ I uniformly between −100% and 0%. This allows us to control the sign of ρ DL , which we must do in order to ensure a positive relationship between default and potential loss. In the two-factor model we randomly generated ρ S uniformly on [−1, 1]. If ρ S is positive, randomly generate ρ I uniformly on [0, 1], otherwise randomly generate ρ I uniformly on [−1, 0]; • We choose the transformation h(·) to ensure that (i) potential loss is beta distributed and (ii) there is a positive relationship between default and loss. The paramters a and b of the beta distribution are generated independently from an exponential distribution with unit mean. If is the cumulative distribution function for the beta distribution with parameters a and b. Note that under these restrictions, in the one-factor model the expected loss function µ(z) is monotone decreasing.
In order to ensure that we are considering cases of practical interest, we randomise the portfolio size and loss threshold as follows.

•
Generate the number of exposures randomly between 10 and 5000; • In the one-factor model we generate the threshold x by setting x = µ(Φ −1 (10 −q )), where q is uniformly distributed on [1,5]. The LPA suggests that This means that log(p x ), the order of magnitude of the probability of interested, is approximately uniformly distributed on [−5, −1]. In the two-factor model we set x = µ(z q ), where z q = (Φ −1 (q), ρ S Φ −1 (q)) and q is uniformly distributed on [−5, −1].

Implementation
In this section we discuss our implementation of the algorithm proposed in Section 3 in the general framework outlined in Section 5. As the general framework encompasses many of the PD-LGD correlations that have been proposed in the literature, this section effectively discusses implementation of the proposed algorithm across a wide variety of models that are used in practice.

Selecting the IS Density for the Systematic Risk Factors
The systematic risk factors here are Gaussian. When constructing their IS density we could either shift their means and leave their variances (and correlations) unchanged, or shift their means and adjust their variances (and correlations). Recall that the ultimate goal is to choose an IS density that closely resembles the ideal density f x given in Equation (15). As illustrated 8 in Figure 2, the ideal density f x tends to be very tightly concentrated about its mean, and adjusting the variance of the systematic risk factors leads to a much better approximation to the ideal density for "typical values" of the ideal density. The left tail of the ideal density is, however, heavier than the variance-adjusted IS density, an issue that can be resolved by trimming large IS weights.
The downside to adjusting the variance of the systematic risk factors is that it can lead to first-stage IS weights with infinite variance, but numerical evidence suggests that this issue can be mitigated by 8 In the one-factor model, a tractable approximation to the ideal density can be obtained by using the LDA of Equation (13) to approximate both probabilities appearing in Equation (15). The result is: and the right-hand side of Equation (40) can be approximated via quadrature. As the integrand involvesθ, the approximation is computationally very slow.
trimming large weights. Indeed, numerical experiments 9 suggest that adjusting variance and trimming large weights leads to substantially more accurate estimators of p x . Intuitively, it is more important for the IS density to mimic the behaviour of the ideal density over its "typical range", as opposed to faithfully representing its tail behaviour. In addition to improving statistical accuracy, adjusting variance has the added benefit of making the second stage of the algorithm more computationally efficient in terms of run time. Indeed, as discussed in more detail in Section 6.3, adjusting variance tends to increase the proportion of first-stage simulations that land in the region of interest (thereby reducing the number of times the rejection sampling algorithm must be employed in the second stage) and reduces the average size of the rejection constants employed in the second stage (thereby making the rejection algorithm more effective whenever it must be employed).

First Stage
In this section we explain how to efficiently approximate the parameters of the optimal IS density for the systematic risk factors, in both the one-and two-factor models. We also explain how we trim large IS weights, and demonstrate that the resulting bias is negligible.

Computing Parameters in the Two-Factor Model
In the two-factor model the systematic risk factors are bivariate Gaussian with zero mean vector and covariance matrix: The mean vector and covariance matrix that satisfy the criteria of Equation (25) are: 10 and respectively. In order to approximate the suggested mean vector and covariance matrix we use Equation (27) to get: and The expected values appearing on the right-hand sides of Equations (43) and (44)  Whether or not we adjust the variance of the systematic risk factor, the standard error of the resulting estimator is of the form ν/ √ M, where ν depends on the model parameters and is easily estimated via simulation. Using 100 randomly selected parameter sets from the one-factor model, selected according to the procedure described in Section 5.3, we find that for the one-stage estimator ν MS /ν VA ≈ 1.54p −0.03 x , where ν MS denotes the value of ν assuming we only shift the mean of the systematic risk factor and do not adjust its variance and ν VA denotes the value when we do adjust variance. For probabilities in the range of interest, then, adjusting the variance of the systematic risk factor leads to an estimator that is nearly four times as efficient, in the sense that the sample size required to achieve a given degree of accuracy (as measured by standard error) is nearly four times larger if we do not adjust variance. 10 As discussed in Appendix B, the natural sufficient statistic here consists of the components of Z plus the components of ZZ T . As such, in order to satisfy Equation (27)  In order to implement the approximation we must first simulate the systematic risk factors and then compute q(x, z) for each sample point z. The most natural way to proceed is to (i) sample the systematic risk factors from their actual distribution (bivariate Gaussian with zero mean vector and covariance matrix Σ) and (ii) numerically solve the equation k (θ, z) = x in order to compute computê θ(x, z) for each pilot sample point z that lies outside the region of interest. In our experience this leads to unacceptably inefficient estimators, in terms of both (i) statistical accuracy and (ii) computational time. We deal with each issue in turn.
As most of the variation in q(x, ·) occurs just outside the boundary of the region of interest (recall the right panel of Figure 1), we suggest using an IS distribution for the pilot simulation that is centered on the boundary of the region. Specifically, we suggest using that point on the boundary at which the density of the systematic risk factors attains its maximum value (i.e., the most likely point on the boundary): The non-linear minimisation problem appearing above is easily and rapidly solved using standard techniques. We used fmincon function in Matlab.
As z x lies on the boundary of the region of interest, roughly half the pilot sample will lie outside the region. In Section 5.2 we noted that it takes nearly one tenth of one second to numerically solve the equation k (θ, z) = x. As such, if we are to computeθ exactly (i.e., by numerically solving the indicated equation) for each sample point that lies outside the region of interest, the total time required (in seconds) to estimate the first-stage IS parameters will be at least M p /20. In our numerical examples we use a pilot sample size of M p = 1000, which means that it would take nearly one full minute to compute the first-stage IS parameters. This discussion suggests that reducing the number of times we must numerically solve the equation k (θ, z) = x could lead to a dramatic reduction in computational time.
We suggest fitting a low degree polynomial to the functionθ(x, ·), over a small region in R 2 that contains all of the pilot sample points that lie outside the region of interest. Specifically, we determine the smallest rectangle that contains all of the pilot sample points, and discretize the rectangle using a mesh of n 2 g points, equally spaced in each direction. Next, we identify those mesh points that lie outside the region of interest and computeθ(x, z) exactly (i.e., by solving k (θ, z) = x numerically) for each such point. Finally, we fit a polynomial to the resulting (z,θ(x, z)) pairs and call the resulting functionθ(x, ·). Numerical evidence indicates at using a fifth-degree polynomial and a mesh with 15 2 = 225 points leads to a sufficiently accurate approximation toθ(x, ·) over the indicated range (the intersection of (i) the smallest rectangle that contains all sample points and (ii) the complement of the region of interest). Note thatθ could be an extremely inaccurate approximation toθ outside this range, but that is not a concern because we will never need to evaluate it there.
It remains to compute q(x, z) for each of the pilot points z. For those points z that lie inside the region of interest, we set q(x, z) = 0. For those points that lie inside the region, we set q(x, z) = θx − k(θ, z), whereθ =θ(x, z). Evaluatingθ(x, ·) requires essentially no computational time (it is a polynomial), and if the mesh size and degree are chosen appropriately the difference betweenθ andθ is very small. In total, the suggested procedure reduces the number of evaluations ofθ from M p /2 to n g /2, for a percentage reduction of n 2 g /M p . In our numerical examples we use n g = 15 and M p = 1000, which corresponds to a reduction of 75% in computational time.
To summarise, we estimate the optimal first-stage IS parameters as follows. First, we compute z x . Second, we draw a random sample of size M p from the Gaussian distribution with mean vector z x and covariance matrix Σ. Third, we constructθ(x, ·), the polynomial approximation toθ(x, ·), as described in the previous paragraph. Fourth, for those sample points z that lie outside the region of interest we compute q(x, z) usingθ insteadθ. The estimates of the optimal first-stage IS parameters are then: where Z 1 , . . . , Z M p is the random sample and is the IS weight associated with shifting the mean of the systematic risk factors from 0 to z x . The upper left panel of Figure 3 illustrates a typical situation where the mean of the IS distribution lies "just inside" the region of interest.

Computing Parameters in the One-Factor Model
The procedure described in the previous section specialises in the one-factor case as follows. First, under the parameter restrictions outlined in Section 5.3, the expected loss function µ(z) is a strictly decreasing function of z. As such, the region of interest is the semi-infinite interval (−∞, z x ), where z x := µ −1 (x), and its boundary is the single point z x . In general z x must be computed numerically, which is straightforward. Second, we draw a random sample of size M p from the Gaussian distribution with mean z x and unit variance. Third, the polynomial approximation toθ is constructed by evaluatinĝ θ exactly (i.e., by numerically solving the equation k (θ, z) = x) at each of n g equally-spaced points z in the interval [z − , z + ], where z − and z + are the largest and smallest values obtained in the pilot simulation, respectively, and then fitting a polynomial to the resulting (z,θ(x, z)) pairs. Fourth, we evaluate q(x, z) for each pilot sample point z as follows-if z lies inside the region of interest we set q(x, z) = 0, otherwise we compute q(x, z) by replacing the exact valueθ(x, z) with the approximate valueθ(x, z), whereθ is the polynomial constructed in the previous step. Note that a single evaluation ofθ requires far less computational time than a single evaluation ofθ. Finally, the approximations to the first-stage IS parameters are: where Z 1 , . . . , Z M p is the random sample and is the IS weight associated with shifting the mean of the systematic risk factor from 0 to z x .

Trimming Large Weights
In the one-factor model the first-stage IS weight will have infinite variance whenever σ 2 IS < 0.5 (see Remark A1 in Appendix B). In a sample of 100 parameter sets, randomly selected according to the procedure in Section 5.3, the largest realised value of σ 2 IS was 0.38, and the mean and median were 0.11 and 0.09, respectively. It appears, then, that the first-stage IS weight in the one-factor model will have infinite variance in all cases of practical interest. We trim large weights as described in Section 4.2, using the set: for some constant C. In the numerical examples that follow we use C = 4, in which case we expect to trim less than 0.01% of the entire sample. Specialising Equation (34) to the present context, we get that an upper bound on the associated bias is given by: which is straightforward (albeit slow) to compute using quadrature. Figure 4 illustrates the relationship between the probability of interest p x and the upper bound of Equation (46) for the 100 randomly generated parameter sets, and clearly demonstrates that the bias associated with our trimming procedure is negligible. For instance, for probabilities on the order of 10 −3 the bias is no larger than 10 −5 , or 1% of the quantity of interest.
In the two-factor model the first-stage IS weight will have infinite variance whenever det(2Σ −1 − Σ −1 IS ) < 0. In a random sample of 100 parameter sets, this condition occurred 96 times. As in the one-factor model, then, the first-stage IS weight in the two-factor model can be expected to have infinite variance in most cases of practical interest. We trim large weights using the set: for some constant C, and use C = 4 in the numerical examples that follow. , for 100 randomly generated parameter sets in the one-factor case. For each set, we compute bias (in fact, an upper bound on the bias) by using quadrature to approximate Equation (46) and estimate the probability of interest using the full two-stage algorithm.

Second Stage
The first stage of the algorithm consists of (i) computing the first-stage IS parameters, (ii) simulating a random sample of size M from the systematic risk factors' IS distribution, and (iii) computing the associated IS weights, trimming large weights appropriately. Having completed these tasks, the next step is to simulate individual losses in the second stage. In the remainder of this section we let z = (z D , z L ) denote a generic realisation of the systematic risk factors obtained in the first stage.

Approximatingθ
Before generating any individual losses first construct the polynomial approximation toθ, using the same procedure described in Section 6.2.1. The basic idea is to fit a relatively low degree polynomial to the surface ofθ(x, ·), over a small region that contains all of the first-stage sample points. The values of z obtained in the pilot sample are invariably different from those obtained in the first stage, so it is essential that the polynomial is refit to account for this fact. In what follows we useθ to approximateθ whenever the numerical value ofθ is required, but since the difference between the two is small we do not distinguish between the two (i.e., we writeθ in this document, but useθ in our code).

Sampling Individual Losses
In this section we describe how to sample individual losses in the two-factor model. The procedure carries over in an obvious way to the one-factor model, so we do not discuss that case explicitly.
If z lies inside the region of interest then the second stage is straightforward. For a given exposure i, we first simulate the exposure's idiosyncratic risk factors Y i = (Y i,D , Y i,L ), from the bivariate normal distribution with standard normal margins and correlation ρ I . Next, we set: If X i,D > Φ −1 (P) then the exposure did not default and we set L i = 0 and proceed to the next exposure. Otherwise the exposure did default, in which case we must compute h(X i,L ), set i = h(x i,L ) and then proceed to the next exposure. Note that we only evaluate h for defaulted exposures-this is important since evaluating h requires numerical inversion of the beta cdf, which is relatively slow.
Having computed the individual losses associated with each exposure, we then compute the average loss¯ = N −1 ∑ N i=1 i and set Λ 2 (z,¯ ) = 1. If z lies outside the region of interest we must computeθ, k(θ) andĉ, which we do approximately using the polynomial approximationθ. We then sample fromĝ x (·|z) as follows. First simulate the idiosyncratic risk factors Y i = (Y i,D , Y i,L ) from the bivariate normal distribution with standard normal margins and correlation ρ I . Also generate a random number U, independent of Y i . Then set: If the exposure did not default we setL i = 0, otherwise we compute h and setL i = h(X i,L ). Next we check whether or not then acceptL i as a drawing fromĝ x , that is, set L i =L i and proceed to exposure i. Otherwise, draw another random number U and set of idiosyncratic factors. Once we have sampled the individual losses associated with each exposure we compute the average loss¯ = N −1 ∑ N i=1 i and set Λ 2 (z,¯ ) = exp(−N[θ¯ − k(θ, z)]), using the polynomial approximation to estimate the value ofθ.

Efficiency of the Second Stage
The frequency with which the rejection sampling algorithm must be applied in the second stage is governed by P IS (µ(Z) > x). The left panel of Figure 5 illustrates the empirical distribution of this probability across 100 randomly selected parameter sets. The distribution is concentrated towards small values (the median fraction is 27%) but does have a relatively thick right tail (the mean fraction is 35%). In some cases-particularly when the value of the parameter ρ D is close to zero, in which case individual losses are very nearly independent and systematic risk is largely irrelevant-the vast majority of first-stage simulations require further IS in the second stage.
The efficiency of the rejection sampling algorithm, when it must be applied, is governed by the conditional distribution ofĉ =ĉ(x, Z) given that µ(Z) < x. For each of the 100 parameter sets we estimate E IS [ĉ(x, Z)|µ(Z) < x], which determines the average size of the rejection constant for a given set of parameters, by computing the associated value ofĉ for each first-stage realisation that lies outside the region of interest and then averaging the resulting values. The right panel of Figure 5 illustrates the results, and we note that the mean and median of the data presented there are 1.09 and 1.17, respectively. The figure clearly indicates that the rejection sampling algorithm can be expected to be quite efficient, whenever it must be applied.
The distributions of P IS (µ(Z) < x) and E IS [ĉ(x, Z)|µ(Z) < x] across parameters depend heavily on whether or not we adjust the variance of the systematic risk factors in the first stage. When we do not adjust variance, the mean and median of P IS (µ(Z) < x) (across 100 randomly selected parameter sets) rise to 49% and 45% (as compared to 35% and 27% when we do adjust variance), and the mean and median of E IS [ĉ(x, Z)|µ(Z) < x] rise to 18.6 and 1.8, respectively (as compared to 1.17 and 1.09 when we do adjust variance).
Remark 7. If we do not adjust the variance of the systematic risk factors in the first stage, then (i) the rejection sampling algorithm must be applied more frequently and (ii) is less efficient whenever it must be applied. As such, adjusting the variance of the systematic risk factors reduces the total time required to implement the two-stage algorithm. Recall that the former quantity determines the frequency with which the second-stage rejection sampling algorithm must be applied and the latter quantity determines the efficiency of the algorithm when it must be applied. For each of 100 parameter sets, randomly selected according to the procedure described in Section 5.3, we compute the first-stage IS parameters and then draw 10,000 realisations of the systematic risk factors from the variance adjusted first-stage IS density.
The intuition behind this fact is as follows. First recall that the mean of the systematic risk factors tends to lie just inside the region of interest (recall Figure 3). In such cases the effect of reducing the variance of the systematic risk factors is to concentrate the distribution of Z just inside the boundary of the region of interest. Not only will this ensure that more first-stage realisations lie inside the region of interest (thereby reducing the fraction of points that require further IS in the second stage), it will also ensure that those realisations that lie outside the region (i.e., for which µ(z) < x) do not lie "that far" outside the region (i.e., that µ(z) is not "that much less" than x), which in turn ensures that the typical size ofĉ is relatively close to one (recall the left panel of Figure 1).

Performance Evaluation
In this section we investigate the proposed algorithms' performance in terms of statistical accuracy, computational time, and overall. Unless otherwise mentioned, we use a pilot sample size of M p = 1000 to estimate the first-stage IS parameters and a sample size of M = 10,000 to estimate the probability of interest (p x ). We use the value C = 4 to trim large first-stage IS weights, and a value of c max = 10 to trim large rejection constants.

Statistical Accuracy
The standard error of any estimator that we consider is of the form ν x / √ M for some constant ν x that depends on the algorithm used and the model parameters. For instance, for the one-stage estimator in the two-factor case we have ν x = SD 1S (Λ 1 (Z) · 1 {L N >x} ), where SD 1S denotes standard deviation under the one-stage IS density of Equation (31). Note that in the absence of IS we have Figure 6 illustrates the relationship between ν x and p x using 100 randomly selected parameters sets, for the two-stage algorithm and in the two-factor case. Importantly, we see that (i) ν x seems to be a function of p x (i.e., it only depends on model parameters through p x ) and (ii) for small probabilities the functional relationship appears to be of the form ν x = ap b x for constants a and b. These features are also present in the case of the one-stage estimator, as well as for both estimators in the one-factor model. The numerical values of a and b are easily estimated using the line of best fit (on the logarithmic scale), and the estimated values for both the one-and two-factor cases are summarised in Table 1. Of particular note is the fact that the value of b is extremely close to one in every case.  (32), in the two-factor case.
The numerical values of p x and ν x are estimated for each of 100 randomly generated parameters sets, according to the procedure described in Section 5.3. Table 1. This table reports fitted values of the relationship ν x ≈ ap b x for each estimator (one-and two-stage) and each model (one-and two-factor). Values of a and b are obtained by determining the line of best fit on the logarithmic scale (i.e., the line appearing in Figure 6). Note that in the absence of IS we would have ν x .

One-Stage Algorithm Two-Stage Algorithm
One-Factor Model 0.91p 0.98 x 0.81p 0.99 x Two-Factor Model 0.98p 0.98 x 0.81p 0.98 x Of particular interest in the rare event context is an estimator's relative error, defined as the ratio of its standard error to the true value of the quantity being estimated. For any of the estimators that we consider, the component of relative error that does not depend on sample size is ν x /p x ≈ ap b−1 x . In the absence of IS we have b − 1 = −0.5, in which case relative error grows rapidly as p x → 0 (i.e., ν x → 0 but ν x /p x → ∞ as p x → 0). By contrast, b ≈ 1 for any of our IS estimators, in which case there is weak dependence of relative error on p x . The minimum sample size required to ensure that an estimator's relative error does not exceed the threshold is v 2 In the absence of IS we have b ≈ 0.5, in which case the sample size (and therefore computational burden) required to achieve a given degree of accuracy increases rapidly as p x → 0. By contrast, for all of our IS estimators we have b ≈ 1, in which case the minimum sample size (and computational burden) is nearly independent of p x . Our ultimate goal is to reduce the computational burden associated with estimating p x , in situations where p x is small. To see how effective the proposed algorithms are in this regard, note that the sample size required to achieve a given degree of accuracy using the proposed algorithm, relative to that required to achieve the same degree of accuracy in the absence of IS, is approximately which does not depend on . Since a < 1 and b > 0.5 (recall Table 1), we have that a 2 p 2b−1 x < p x .
Remark 8. The relative sample size required to achieve a given degree of accuracy using the proposed algorithm, relative to that required in the absence of IS, is not larger than the probability of interest. For example, if the probability of interest is approximately 1%, then the proposed algorithm requires a sample size that is less than 1% of what would be required in the absence of IS (regardless of the desired degree of accuracy). And if the probability of interest is 0.1%, then the proposed algorithm requires a sample size that is less than 0.1% of what would be required in the absence of IS. In other words, the proposed algorithm is extremely effective at reducing the sample size required to achieve a given degree of accuracy.
It is also insightful to compare the efficiency of the two-stage estimator, relative to the one-stage estimator. In the one-factor case, the minimum sample size required using the two-stage algorithm, relative to that required using the one-stage algorithm, is approximately: As p x ranges from 1% to 0.01% the estimated relative sample size ranges from 0.73 to 0.67. In the two-factor case, the relative sample size is approximately 0.69, regardless of the value of p x .

Remark 9.
In both the one-and two-factor models, the two-stage algorithm is more efficient than the one-stage algorithm, in the sense that it requires a smaller sample size in order to achieve a given degree of accuracy. Indeed, in cases of practical interest (probabilities in the range of 1% to 0.01%) the minimum sample size required to achieve a given degree of accuracy using the two-stage algorithm is roughly 70% of what would be required using the one-stage algorithm. Figure 7 illustrates the relationship between sample size (M) and run time (total time required to estimate p x using a particular algorithm), for one randomly selected set of parameters. Across both models and algorithms, the relationship is almost perfectly linear. In the absence of IS the intercept is zero (i.e., run time is directly proportional to sample size), whereas the intercepts are non-zero for the IS algorithms. The non-zero intercepts are due to the overhead associated with (i) computing the first-stage IS parameters, which accounts for almost all of the difference between the intercepts of the solid (no IS) and dashed (one-stage IS) intercepts, and (ii) computing the second-stage polynomial approximation toθ, which accounts for almost all of the difference between the intercepts of the the dashed (one-stage IS) and dash-dot (two-stage IS) lines. It is also worth noting that a given increase in sample size will have a greater impact on the run times for the IS algorithms than it will on the standard algorithm. This is because we only calculate h(X i,L ) for defaulted exposures (evaluating h(·) is slow because it requires numerical inversion of the beta distribution function), and the default rate is higher under the IS distribution.

Computational Time
Across 100 randomly generated parameter sets, portfolio size (N) is most highly correlated with run time and the relationship is roughly linear. Table 2 reports summary statistics on run times, across algorithms and models. Table 2. This table reports summary statistics-in seconds, and across 100 randomly selected parameter sets-for total run time (first three columns), time required to estimate the first-stage IS parameters (fourth column) and time required to fit the second-stage polynomial approximation toθ (final column).

Overall Performance
Recall that the ultimate goal of this paper is to reduce the computational burden associated with estimating p x , when p x is small. The computational burden associated with a particular algorithm is a function of both its statistical accuracy and total run time. We have seen that the proposed algorithms are substantially more accurate, but require considerably more run time. In this section we demonstrate that the benefit of increased accuracy is well worth the cost of additional run time, by considering the amount of time required by a particular algorithm in order to achieve a given degree of accuracy (as measured by relative error).
To begin, let t(M) denote the total run time required by a particular algorithm to estimate p x using a sample of size M. As illustrated in Figure 7 we have t(M) ≈ c + dM for constants c and d that depend on the underlying model parameters (particularly portfolio size, N) as well as the algorithm being used. In Section 7.1 we saw that the minimum sample size required to ensure that the estimator's relative error does not exceed the threshold isL for constants a and b depending on the underlying model (one-or two-factor) and algorithm being used. Thus, if T( ) denotes the total CPU time required to ensure that the estimator's relative error does not exceed , we have:  Figure 7 to estimate c and d and the values of a and b implicitly reported in Table 1.
The results reported in the table are representative of those obtained using different parameter sets. It is clear that the proposed algorithms can substantially reduce the computational burden associated with accurate estimation of small probabilities. For instance, if the probability of interest is on the order of 0.1% then either of the proposed algorithms can achieve 5% accuracy within 2-3 s, as compared to 4 min (80 times longer) in the absence of IS.  (48)) for several values of p x and , for the parameter values corresponding to the left panel of Figure 7. Note that this is for the one-factor model. Values of c and d are obtained from the lines of best fit appearing in the left panel of Figure 7, values of a and b are obtained from The two-stage estimator is statistically more accurate (Section 7.1) but computationally more expensive (Section 7.2) than the one-stage estimator. It is important to determine whether or not the benefit of increased accuracy outweighs the cost of increased computational time. Table 3 suggests that, in some cases at least, implementing the second stage is indeed worth the effort, in the sense that it can achieve the same degree of accuracy in less time. Figure 8 illustrates the overall efficiency of the proposed algorithms, as a function of the desired degree of accuracy. Specifically, the left panel illustrates the ratio of (i) the total CPU time required to ensure the standard estimator's relative error does not exceed a given threshold to (ii) the total time required by the proposed algorithms, for a randomly selected set of parameter values in the one-factor model. The right panel illustrates the same ratio for a randomly selected set of parameters in the two-factor model. In the one-factor model, it would take hundreds of times longer to obtain an estimate of p x whose relative error is less than 10%, and thousands of times longer to obtain an estimate whose relative error is less than 1%. The figure also suggests that, since it requires less run time to obtain very accurate estimates, the two-stage algorithm is preferable to the one-stage algorithm in the one-factor model. In the two-factor model-where estimating IS parameters and fitting the second-stage polynomial approximation toθ is more time consuming-the proposed algorithms are hundreds of times more efficient than the standard algorithm. In addition, it appears that the one-stage algorithm is preferable to the two-stage algorithm in this case. Although the numerical values discussed here are specific to the parameter set used to produce the figure, they are representative of other parameter sets. In other words, the behaviour illustrated in Figure 8 is representative of the general framework overall.

Concluding Remarks
This paper developed an importance sampling (IS) algorithm for estimating large deviation probabilities for the loss on a portfolio of loans. In contrast to existing literature, we allowed loss given default to be stochastic and correlate with the default rate. The proposed algorithm proceeded in two stages. In the first stage one generates systematic risk factors from an IS distribution that is designed to increase the rate at which adverse macroeconomic scenarios are generated. In the second stage one checks whether or not the simulated macro environment is sufficiently adverse-if it is then no further IS is applied and idiosyncratic risk factors are drawn from their actual (conditional) probability distribution, if it is not then one indirectly applies IS to the conditional distribution of the idiosyncratic risk factors. Numerical evidence indicated that the proposed algorithm could be thousands of times more efficient than algorithms that did not employ any variance reduction techniques, across a wide variety of PD-LGD correlation models that are used in practice.
Author Contributions: Both authors contributed equally to all parts of this paper. Both authors have read and agreed to the published version of the manuscript. paragraph, we see thatθ(t) = q(t) = 0 whenever µ ≥ t, whereas bothθ(t) and q(t) are strictly positive whenever µ < t.
The derivative of the transform q is demonstrably equal to: Sinceθ = 0 whenever t ≤ µ and k (θ) = t whenever t > µ, the second term above vanishes for all t, and we find that: q (t) =θ(t) . (A2)

Appendix A.3. Exponential Tilts
For θ ∈ R we define: The density f θ is called an exponential tilt of f . As the value of the tilt parameter θ varies, we obtain an exponential family of densities (exponential families have lots of very useful properties, and this is an easy way of constructing them). If θ is positive then the right and left tails of f θ are heavier and thinner, respectively, than those of f . The opposite is true if θ is negative. The larger in magnitude is θ, the greater the discrepancy between f and f θ ; indeed the Kullback-Leibler divergence from f θ to f is −θµ + k(θ), which is a strictly convex function of θ that attains its minimum value (of zero) at θ = 0. It is readily verified that k (θ) = E θ [X i ], where E θ denotes expectation with respect to f θ . This observation, in combination with the developments in Section A.1, implies that it is always possible to find a density of the form (A3) whose mean is t, whatever the t ∈ (x min , x max ). Indeed fθ is precisely such a density. Under mild conditions, fθ(·) can be characterised as that density that most resembles f (in the sense of minimum divergence), among all densities whose mean is x (and are absolutely continuous with respect to f ).
Recall thatθ is the unique value of θ for which k (θ) = max(t, µ). We can therefore interpret fθ as that density that most resembles f , among all densities whose mean is at least t (and that are absolutely continuous with respect to f ). Note in particular hat the mean of fθ is max(µ, t). The numerical value ofθ can therefore be interpreted as the degree to which we must deform the density f , in order to produce a density whose mean is at least t. If µ ≥ t thenθ = 0 and no adjustment is necessary. If µ < t thenθ > 0 and mass must be transferred from the left tail to the right; the larger the discrepancy between µ (the mean of f ) and t (the desired mean), the larger isθ.
Appendix A.4. Behaviour of X i , Conditioned on a Large Deviation Let f t (x) denote the conditional density of X i , given that X N > t, where X N = 1 N ∑ N i=1 X i . We suppress the dependence of f t on N for simplicity. Using Bayes' rule we get and since the X i are independent, we get P(X N > t|X i = x) = P(X N−1 > t + t−x N−1 ) . Now, using the large deviation approximation P(X N ≥ t) ≈ exp(−N · q(t)), we get that P(X N > t|X i = x) P(X N > t) ≈ exp(−(N − 1)q(t + t−x N−1 ) + Nq(t)) .

Now if N is large then
where we have used the fact that q (t) =θ(t). Putting everything together we arrive at the approximation P(X N > t|X i = x) P(X N > t) ≈ exp(θx − k(θ)) , which leads to the approximation We may thus interpret the conditional density f t as that density which most resembles the unconditional density f , but whose mean is at least t.
Appendix A.5. Approximate Behaviour of (X 1 , X 2 , . . . , X N ), Conditioned on a Large Deviation Letf t (x) =f t (x 1 , . . . , x N ) denote the conditional density of (X 1 , . . . , X N ), given that X N > t. Then where p t = P(X N > t) and A N,t is the set of those points x ∈ [x min , x max ] N whose average value exceeds t. We seek a density h(x), supported on [x min , x max ], which minimizes the Kullback-Leibler divergence (KLD) ofĥ In other words, we seek an independent sequence Y 1 , Y 2 , . . . , Y N (whose density isĥ) whose behaviour most resembles (in a certain sense) the behaviour of X 1 , X 2 , . . . , X N , conditioned on the large deviation X N > t.
Now let E g denote expectation with respect to the density g. Then the divergence ofĥ fromf t is Now, the middle term in the above display is the KLD of h from f t . As such it is non-negative, and is equal to zero if and only if h = f t . It follows immediately that the divergence ofĥ fromf t is minimised by setting h = f t .

Appendix B. Important Exponential Families
This appendix considers two important special cases-the Gaussian and t families-of the general setting discussed in Section 2.2.

Remark A1.
Suppose that f and f λ are Gaussian densities with respective positive definite covariance matrices Σ 0 and Σ. Further suppose that Z ∼ f λ . Then the variance of f (Z)/ f λ (Z) is finite if and only if det(2Σ −1 0 − Σ −1 ) > 0.
In the one-dimensional case d = 1 we write Z = Z. The condition in Remark A1 is satisfied whenever σ 2 > σ 2 0 /2. In other words, if the variance of the IS distribution is too small, relative to actual variance of Z, then the IS weight will have infinite variance.
In the case that Z is multivariate t, then, we can take our systematic risk factors to be the components of (Ẑ, R). In this case the joint density of the systematic risk factors can be embedded into the parametric family f λ,η (ẑ, r) := exp(λ T S(ẑ) − K(λ)) · exp(η T T(r) − L(η)) · f (ẑ) · g(r) , where λ is and S are the natural parameter and sufficient statistic for the Gaussian family, η and L are those for the chi-square family, and f and g are the Gaussian and chi-square densities.