1. Introduction
Model stability is the extent to which the predictive accuracy of a model deteriorates since its development by examining changes in the distribution of the explanatory variables. This is important because models used in credit risk assessment, such as probability of default (PD) models, are generally developed using one set of data (development data) but implemented using different data collected subsequently. For example, under the Basel Accord (
Basel Committee on Banking Supervision 2006) for capital and the International Financial Reporting Standards (IFRS 9) for provisioning (
International Accounting Standards Board 2014), models are developed to predict future potential losses and are implemented until they are considered no longer “fit-for-purpose”. Model stability is an important feature of fit-for-purpose model reviews because stability does not require the observation of the response variable and therefore is available immediately. Other statistics, such as the calibration of the PD to actual default rates, require the observation of loan defaults. For example, a PD model predicting defaults in the next 12 months will use model inputs that are at least 12 months old. Hence, evaluations of stability are timely and important, not only for banks, insurance, and other companies performing credit or other risk modelling, but also for auditors and regulators. The important question of whether a model remains “fit-for-purpose” when the response variable (e.g., default outcome) is also available is typically not referred to as stability and is beyond the scope of this paper.
Traditionally, model stability has been evaluated using the Population Stability Index (PSI), especially in the context of credit models. However,
Taplin and Hunt (
2019) showed this had inferior properties for model stability. In particular, by detecting any shift in the distribution of the explanatory variables, it detects changes that have no impact on the predictive accuracy of a model. Indeed,
Taplin and Hunt (
2019) provided examples of where the PSI indicated low model stability when the accuracy of the model actually improved in the review data compared to the development data. This situation does not indicate a lack of stability of the model, making the PSI of questionable value when assessing whether a model remains fit-for-purpose.
Taplin and Hunt (
2019) introduced the Predictive Accuracy Index (PAI) to assess the degree of deterioration in the model’s predictive accuracy more effectively than the PSI (see
Section 1.1 for its definition).
Kruger et al. (
2021) recommended the PAI over the PSI due to its superior properties. They particularly recommended the multivariate PAI (MPAI) over the univariate version that considers only one explanatory variable at a time.
Becker and Becker (
2021) provided further examples where the PSI and PAI produce different results; however, they only considered the univariate version.
Considering the high interest in the PAI, experience gained from using the PAI since its introduction in 2019, and a lack of publications exploring the properties of the PAI, this paper aims to:
- (a)
explore the properties of the (multivariate) PAI;
- (b)
reflect on its use in practice;
- (c)
make further recommendations on how to investigate the cause of instability when a high value of the PAI indicates a lack of model stability.
These results are important in practice in two ways. First, they provide a set of additional statistics and analyses suitable when a model lacks stability. Second, they suggest how techniques used to develop models may lead to model instability. Thus, this paper also contributes by providing advice on model development that is relevant to model developers, auditors, and regulators, providing advice or guidelines. However, this paper does not consider techniques using the response variable to assess whether a model remains fit-for-purpose and it does not consider the broader question of how important variables in a model are to its predictions. While relatively straightforward for techniques such as regression, the machine learning literature contains important research on this topic for other modelling techniques (
Lundberg and Lee 2017;
Giudici and Raffinetti 2021).
1.1. The Prediction Accuracy Index (PAI)
Taplin and Hunt (
2019, p. 5) defined the Prediction Accuracy Index (PAI) as “the average variance of the estimated mean response at review divided by the average variance of the estimated mean response at development”. This definition is broad and can be applied to any model.
Taplin and Hunt (
2019) show that, in the important case of multiple regression (or any model such as logistic regression for probability of default modelling that uses a linear predictor)
where
is the vector of explanatory variables for the th observation of the review data ( = 1 to );
is the vector of explanatory variables for the th observation of the development data ( = 1 to );
is the variance–covariance matrix of the estimated regression coefficients;
is the design matrix with columns defined by the (the explanatory variables at development).
The vectors and are column vectors that typically have a dimension of as they contain not only the explanatory variables but also a 1 for the intercept.
When interpreting values of the PAI,
Taplin and Hunt (
2019, pp. 5–6) recommended “values less than 1.1 indicate no significant deterioration; values from 1.1 to 1.5 indicate a deterioration requiring further investigation; and values exceeding 1.5 indicate the predictive accuracy of the model has deteriorated significantly”. Note that, in
Taplin and Hunt (
2019), the PAI is referred to as the Population Stability Index in the title; however, we refer to it as the Prediction Accuracy Index as this is how it is referred throughout
Taplin and Hunt (
2019) and because it more accurately reflects its purpose to summarise the predictive accuracy of a model.
This paper concentrates on the use of the multivariate PAI (referred to as MPAI in
Taplin and Hunt (
2019) but referred to here simply as the PAI) in preference to the univariate PAI (Equation (1) when there is only one explanatory variable). This is because most models contain many explanatory variables and because an examination of the impacts of a change in the distribution of one explanatory variable with the PAI (or PSI) in isolation is inconsistent with the properties of the model (unless the model only contains one explanatory variable). We also assume that the model contains a constant term (therefore, the first column of the design matrix
contains a value of 1). In practice, it would be most unusual for a model not to contain such an intercept. Note that this implies that
and
are vectors with a first entry equal to 1 and that all entries in the first column of the design matrix
equal 1.
1.2. Notation and Illustrative Data Examples
The notation used by
Taplin and Hunt (
2019) for the explanatory variables (
for the review data and
for the development data) is problematic since a variable that always contains the value of 1 is not normally considered an explanatory variable. We therefore introduce more intuitive notation, replacing
,
, and
X with
,
, and
, respectively, when the explanatory variables exclude the 1 for the intercept. For example,
contains the values of the
explanatory variables (excluding the 1 for the intercept) while
is a vector of dimension
with an additional value of 1 in front:
. Here we use
and
(e for explanatory variables, in the traditional sense, without a variable with ones) instead of
(the Greek letter for x) as the capital letter for
is indistinguishable from the capital for
(which might introduce confusion between the new notation and the notation in
Taplin and Hunt (
2019).
Throughout this paper, we illustrate techniques using the simple (fictitious) development data in
Table 1, which contain three explanatory variables:
,
, and
. For example, there are two observations in this data with
,
, and
but only one observation with
,
, and
. This latter observation would be denoted as
in
Taplin and Hunt (
2019) and, in our notation, as
. The three explanatory variables
,
, and
have means of 3.12, 2.08, and 1.44, respectively, in this development data. The variables
and
are highly correlated (
r = 0.91) with moderate, negative correlations between
and
(
r = −0.34) and between
and
(
r = −0.35). The model uses a linear combination of these three explanatory variables to predict the response variable (e.g., a logistic regression model to predict the default). This development data will typically be the raw observations used to develop the model. For example, for a home loan portfolio
,
, and
could represent the loan-to-value ratio (security), the income-to-repayments ratio (serviceability), and the current interest rate (economic conditions).
Together with the development data in
Table 1, we considered two different sets of review data. Review data R1 have the same 50 observations as the development data in
Table 1 together with an additional 51st observation with
,
and
. That is,
(6, 1, 1)
T. For the review data R1, the
equals 1.31, which
Taplin and Hunt (
2019) interpret as a deterioration requiring further investigation. The review data R2 are described in
Table 2. For this review data,
, which
Taplin and Hunt (
2019) interpret as a significant deterioration in the predictive accuracy of the model.
These review data are typically not only out-of-sample (i.e., not used to develop the model) but also out-of-time (i.e., relates to a point in time different to the development data). Typically, the review data are later in time (more recent) than the development data. Stress testing models are, by design, intended to be used in situations where some degree of extrapolation is involved, therefore the PAI and the techniques presented in this paper may be of less value. Rather than designating the models using the benchmarks of 1.1 and 1.5 suggested by
Taplin and Hunt (
2019), for stress testing situations, these techniques may be useful to quantify the extent of the extrapolation, or to determine which econometric variables are responsible for most of the extrapolation.
2. The PAI as a Function of the Squared Mahalanobis Distances
The square of the Mahalanobis distance of a vector of explanatory variables
(relative to the distribution with mean
and variance–covariance matrix
) equals
. First introduced by
Mahalanobis (
1936), this distance is useful for detecting multivariate outliers. A multivariate observation
has
when
and
is large for an outlier if it deviates a large distance from
in a direction in which the standard deviation of the multivariate distribution is relatively small.
While the Mahalanobis distance might be well known to modellers, some might be more familiar with the leverage (typically denoted h) used to determine whether outliers exist in the explanatory variables. Due to the monotonic relationship between the squared Mahalanobis distance and leverage, either is suitable as a scale to examine observations.
Model developers may examine the Mahalanobis distance (or, equivalently, the leverage) of observations in the development data, relative to the distribution (mean and variance–covariance matrix) of the development data. This provides useful information concerning the properties of the development data and which (if any) observations have a high leverage on the estimated parameters. Similarly, model reviewers may examine the Mahalanobis distance of observations in the review data (relative to the distribution of the review data).
Table 3 provides the squared Mahalanobis distance
based on the development data in
Table 1. For example, the squared Mahalanobis distance is 1.0 for the seven observations
. This distance is close to 0 because these observations are close to the mean
of the development data.
increases when the value of any of these variables deviates from the central mean value
, especially when
is high and
is low (or vice versa) due to the high positive correlation between variables
and
.
Note that the squared Mahalanobis distance
is not strictly defined when the explanatory variables are defined as in
Taplin and Hunt (
2019) because the data contain a variable always equal to 1; moreover, since this “variable” has a variance of 0, the variance–covariance matrix would not be invertible. Finally, note that, in defining the variance–covariance matrix
for the explanatory variables, we used the definition of variance and covariances for a population (not a sample). For example, the variance of an explanatory variable
x (with
values) is defined as
(not using
). Using this simple mean of squared deviations simplifies mathematical expressions and makes a negligible difference to numerical values for a realistically large sample (
) of development data.
An Expression for the PAI as a Function of the Squared Mahalanobis Distance
The PAI in Equation (1) can be expressed in terms of the squared Mahalanobis distances as follows (see
Appendix A for the derivation):
where
equals the average of the squared Mahalanobis distance of the review data ,
equals the average of the squared Mahalanobis distance of the development data ,
and all Mahalanobis distances are relative to the mean and variance–covariance matrix of the development data. That is, the squared Mahalanobis distance of observation
(either
for review data or for
for development data) is
where
contains the mean of the explanatory variables at development and
is the variance–covariance matrix of the explanatory variables at development. If the formula for the sample variance–covariance matrix (
n − 1 instead of
n) was used for
, the value of 1 on both the numerator and denominator of Equation (2) would need to be replaced with the value of
.
For example, using the development data in
Table 1, the denominator of Equation (2) equals 3.00 because the average of the squared Mahalanobis distances in the development data equals 2.00. This average of 2.00 is the weighted average of the values in
Table 3 (using the values in
Table 1 as weights). The numerator of Equation (2) equals the corresponding average of squared Mahalanobis distances for the review data. For example, the 51st observation
in review data R1 has a squared Mahalanobis distance of
(see
Table 3). This increases the average squared Mahalanobis distance from 2.00 in the development data to 3.23 in the review data. Thus, from Equation (2), the value of the PAI for the review data R1 is
In this simple example, the reason for this deterioration is the additional observation , which is an outlier relative to the development data.
For review data R2, the average squared Mahalanobis distance is 5.32. Therefore, from Equation (2), the PAI is
Equation (2) is a novel use of the Mahalanobis distance because both the review data (numerator) and development data (denominator) use the distribution of the development data (therefore the numerator uses the mean and variance–covariance matrix of the development data even though the distances are calculated for the review data). This differs from the more routine use of the Mahalanobis distance (such as those produced in standard software), where observations are compared to the dataset they come from (observations in the development data are compared to the distribution of the development data and observations in the review data are compared to the distribution of the review data).
3. The Contribution of Individual Observations to the PAI
Equation (2) suggests that a large PAI is due to observations with a large squared Mahalanobis distance (relative to the average squared Mahalanobis distance at development) during the review. Therefore, when further investigation is required (), it is prudent to examine the Mahalanobis distance of all the review observations in case the model’s deterioration is due to one or several observations. This situation was illustrated using review data R1 in the previous section, where all the excess PAI above 1 was due to one outlier (by construction). The next highest squared Mahalanobis distance is 8.8 for the review observation .
Rather than examining the squared Mahalanobis distance of an observation, we recommend, in the context of the PAI, an alternative scale that is defined by the effect on the PAI if an observation is removed. Thus, we defined
, the influence of review observation
on the PAI, as
where
is the PAI using all
observations of the review data (
;
= 1 to
) and
is the value of the PAI after removing the observation
. The following properties of
also assist with its interpretation (see
Appendix B for derivations):
- (a)
can be either positive or negative;
- (b)
The average of the is 0;
- (c)
A useful reference point for a small is 0, since, while can be negative, the PAI does not change if an observation with is removed;
- (d)
A useful reference point for a large is , since the is reduced to 1 when an observation with is removed;
- (e)
can be written in terms of the squared Mahalanobis distance
of observation
:
For example, removing the outlier
from the review data R1 results in a value of the PAI equal to 1.0 and
. Thus, this one observation accounts for 100% of the amount the PAI of 1.31 exceeds 1.0. For review data R2, each of the three observations
(
Table 2; bottom right) have a value of
equal to 0.02 (the PAI decreases from 1.58 to 1.56 if one of these observations is removed, therefore each of these observations account for only 0.02/0.58 = 3% of the amount the original PAI exceeds 1). In contrast, the observations
have values of
, which are negative because, for these observations, the squared Mahalanobis distance of 1.4 (
Table 3) is less than the average of 2.0 for the development data. The distribution of
values for all the observations in the review data R2 suggest that the large PAI value for R2 is not due to one or a small number of observations.
Furthermore, the large PAI may be due to a subset of observations with a large Mahalanobis distance rather than just one or a few outliers. Moreover, in this case, it may be useful to examine the squared Mahalanobis distance for the subsets of the review data. For example, even if a variable such as the gender of the applicant is not used in the model, the model may predict some of these subsets better than other subsets. These distances can be examined using the review data and development data, or by assessing whether they have changed from development to review. For example,
Table 4 shows the average squared Mahalanobis distance for males and females at development and at review, as well as the PAI for these subsets. While these averages suggest that the model has slightly inferior accuracy when predicting females than males, the difference is considerably higher at review. The resulting large PAI value of 1.31 for females suggests that the model’s predictive ability has deterorated for female applicants.
The subset variable gender in
Table 4 can be any variable (irrespective of whether it is an input of the model). Furthermore, if the value of 4.5 at development for females had been 5.9, then the model would be considered stable (for females, PAI = 6.2/5.9 = 1.05). In this situation, the model is considerably less accurate for females than males at development; however, this model’s predictive ability has not deteriorated from development to review.
4. The Proportion of the PAI Due to a Shift in Distributions
A simple way for the distribution of the explanatory variables to change from development data to review data is for the distribution to shift using a constant (changing the mean). For example, the distribution of loan to value (LVR) in a property portfolio might increase if property values all fall in a recession. This section therefore investigates how much of a high PAI value is due to a shift in the mean of one or more explanatory variables. From Equation (2), the PAI only depends on the review data through the mean squared Mahalanobis distance
. Furthermore,
can be decomposed as follows (to prove this result, write
as
and then multiply out the quadratic form in the defintion of
and simplify):
where the right-hand side can be interpreted as the sum of the two components:
- (1)
, the mean squared Mahalanobis distance of the review data from the mean in the review data;
- (2)
, the squared Mahalanobis distance between the mean of the review data and the mean of the development data.
Interpretability is enhanced if the second component is compared to
instead of comparing the second component to the first component (or their sum,
) because, from Equation (2), it is the amount that
exceeds
that produces a large PAI. We therefore define the contribution to the
due to the difference in the means of the explanatory variables at review (
) compared to at development (
) as:
Note that the two components in Equation (4) are both quadratic forms and thus cannot be negative; therefore, varies from 0 (when the second component equals 0; ) to 1 (when the difference in the means of the explanatory variables from development and review explains all of the amount the PAI exceeds 1). Furthermore, only applies when the PAI exceeds 1 (i.e., ): if , then the model does not perform less accurately on the review data than on the development data, therefore there is no need to explore how much of the PAI is due to a shift in the means of the explanatory variables.
An alternative equation for
is obtained using (see
Appendix C for a derivation):
where
is the Prediction Accuracy Index for the original review data
and
is the Prediction Accuracy Index using the transformed review data
, therefore the distribution of the review data remains intact other than shifting the mean to coincide with the mean of the development data. Equation (6) has a minor advantage over Equation (5) in that any software that can calculate the PAI can be used to calculate
without the need to calculate any Mahalanobis distances.
For example, for review data R1 with the additional observation
(6, 1, 1)
T,
, and
. This extra observation changes the mean of the explanatory variables from
to
; however, transforming the review data to have the same mean as the development data only reduces the PAI by 0.006. Hence, the change in the mean of the variables contributes to only
2% of the 0.31 the PAI exceeds the baseline value of 1. As expected from the discussion in
Section 3, the PAI in this example is not due to a shift in the means of the explanatory variables but due to the single outlier in the review data.
For the review data R2 where , the means of the three explanatory variables in the review data are , therefore the difference relative to development is . When subtracting this vector from each of the review observations (for example, T, it is transformed to ), and produces a value of for the transformed review data. Hence, from Equation (6), the proportion of the excess PAI due exclusively to a change in the means of the explanatory variables is (1.58–1.13)/(1.58–1) = 0.77. That is, 77% of the amount the PAI of 1.58 exceeds the value of 1 (when the model performs equally accurately on review and development data) is due to the shift in the mean of the review data relative to the development data.
The difference in means
suggests that this is primarily due to an increase in the mean of the first explanatory variable from 3.12 to 3.82; however, since the explanatory variables can be measured using different units, it is recommended that the vector
is standardised by dividing by the corresponding standard deviations. For the development data in
Table 1, the standard deviations of the explanatory variables are 1.51, 0.78, and 0.50; therefore, the standardised difference in means are, respectively, 0.70/1.51 = 0.46, −0.06/0.78 = −0.08, and 0.00/0.50 = 0.
The conclusion that the high value of the PAI for review data R2 is primarily due to a shift in the mean of the first explanatory variable is also evident by applying Equation (6) after transforming, so that the mean of only one explanatory variable is changed. For example, if the review data is transformed to , so that the mean for the first explanatory variable equals the mean in the development data but the mean of the two other explanatory variables are not changed, then the PAI becomes 1.14. Hence, the contribution from a shift in the mean of the first explanatory variable is . That is, the shift in the mean of the first explanatory variable alone explains 75% of the excess PAI. Alternatively, if the mean of the second explanatory variable is shifted by transforming the review data to , then the PAI equals 1.47 and . For the third explanatory variable, the means in development and review are equal, therefore .
5. The Contributions of Explanatory Variables to the PAI
Taplin and Hunt (
2019) recommended using the univariate PAI to examine which variables contribute to the high values of the multivariate PAI; however, this does not always produce useful insights. For example, with review data R2 (PAI = 1.58), the univariate PAI for the three variables
,
, and
are, respectively, 1.09, 0.99, and 1.00. In all cases these are less than 1.1 and thus are interpreted by
Taplin and Hunt (
2019) as indicating no significant deterioration. Thus, the univariate PAI fails to diagnose the large PAI value because of a change in the distribution of the first variable
(as discussed in the previous section).
One reason for the univariate PAI not being informative is that models typically include many variables, but the univariate PAI summarises the stability of a model as if the model only contains one variable. This single variable model is most likely very different to the model of interest, and hence less relevant to the multivariate PAI. We therefore recommend calculating the PAI if one variable is removed from the model variables rather than the PAI that includes one variable because removing one variable is a smaller change than removing all but one variable. There is no need to re-estimate the model: the PAI is intended to be calculated using the data without a specific variable.
For the review data R2 (
Table 2), the PAI is 1.01 if the first variable is removed and calculated on the remaining variables
and
. Compared to the original value of
, this represents a decrease of
98% of the amount the original PAI exceeded 1. Thus, unlike the univariate PAI of 1.09, comparing the multivariate PAI with and without the first variable
suggests that the high PAI is due to the variable
. When the multivariate PAI is calculated without the second variable
, the PAI equals 1.09, and it equals 1.75 if the third variable
is removed. When the second variable is removed, the PAI is small because each of the possible combinations for the other two variables appears in the development data (therefore there is no large extrapolation to predict any of the review data). When the third variable is removed, the PAI is very high because the high correlation between
and
in the development data suggests that many observations (e.g., the six observations with
and
) are outliers relative to the development data.
We can draw two conclusions from these examples. First, rather than following the advice in
Taplin and Hunt (
2019) to use the univariate PAI to investigate which explanatory variables contribute to a large PAI, we recommend calculating the PAI after removing a variable (and calculating the PAI based on all the other variables). In some cases, it may be appropriate to remove more than one variable from the model. Examples include removing all the dummy variables for a categorical variable (see
Section 6.1), removing both the variable and its square and then calculating the PAI using the remaining variables for a quadratic relationship, and removing all cross-terms for interaction effects.
Second, we note that the PAI can change considerably depending on which explanatory variables are included; therefore, it may be worth including variables not in the model when calculating the PAI. This advice is consistent with
Taplin and Hunt (
2019, p. 10): “…we recommend calculating the MPAI using only the variables in the model or using these and a few other variables considered important.” For example, suppose the development data (
Table 1) produced a model with only
and
as explanatory variables (the first variable
was not considered to have predictive power). Then, for the variables in this model, the PAI for review data R1 is 1.01, therefore the model is stable (the predictive accuracy of the model at review is similar to the predictive accuracy at development). However, if the variable
was considered during the development of the model, its exclusion from the model could equivalent be considered as a model in which this variable is included but with a coefficient equal to 0 (or insignificantly different to 0). From this perspective, the PAI of 1.31 (including all three variables) is relevant in the sense that the coefficient of (approximately) 0 for variable
might not provide accurate predictions for the outlier observation of
(6, 1, 1)
T in the review data. That is, when the entire model development is considered (rather than just the variables included in the final model), there is a compelling argument that the PAI should be calculated using more variables than just those in the final model. Thus, we suggest that the PAI should also be calculated using all variables considered for inclusion in a model. This is important because the accuracy of predictions from the final model depends not only on the coefficients for variables in the model but also on the choice of which variables are included in the final model. Model stability is arguably relevant for the entire modelling process (not just the final model).
6. Discussion
Both the use of categorical variables and the use of econometric variables in credit models deserve further discussion. Categorical variables are commonly used, highlight some of the conclusions above, and are sometimes constructed from numerical variables. For example, a numeric variable such as an applicant’s age (in years) might be converted into a categorical variable with several categories (e.g., young 18–30; middle-aged 31–50; and old 51+). Econometric variables present challenges to model stability.
6.1. Categorical Variables
First, when considering the impact of a categorical variable on the PAI, it may be more logical to exclude all the dummy variables for that categorical variable rather than exclude them one at a time. Second, not only is a shift in the mean of a categorical variable difficult to define, but a reasonable definition involving the proportion of observations in each category amounts to describing all the possible changes in the distribution of that variable. Thus, examining shifts in the mean of a categorical variable is equivalent to examining the effect of excluding the variable.
Third, many modelling techniques and practices involve converting numeric variables into categorical variables. Examples of this include transforming variables prior to a logistic regression and by using the modelling approach itself, such as a classification tree. For example, the three age groups (young 18–30; middle-aged 31–50; and old 51+) might be the variable in the final model while the age (in years: 18, 19, 20, …) is available in the development data. In these situations, it may be prudent to calculate the PAI using the original numeric variable instead of the categories. This will highlight whether there has been a shift in the distribution of the original age distribution that is not evident from the categories (for example, if the old applicants in the development data were all younger than 55 years old but in the review data they are all over 55 years old).
In the case where a categorical variable (e.g., industry codes) is formed by combining categories, it may be prudent to calculate the PAI using the original (larger) number of categories. This is because changes in the distribution of a categorical variable might be hidden by the combination of a larger number of categories into a smaller number of categories (just as combining many numeric variables into the same category can). It is of interest to investigate the stability of the whole modelling process, not just the stability of the variables and their form in the final model.
6.2. Econometric Variables
Credit models, such as probability of default (PD) models, may contain not only explanatory variables defined by the characteristics of a borrower but also characteristics of economic conditions. For example, under IFRS, nine models use econometric variables such as interest rates or unemployment rates (or recent changes in these rates) to predict future default rates. This enables models to capture changes in default rates through the economic cycle, with the expectation that these will provide more accurate predictions using the current (or forecast future) economic condition. This is justified by the expectation that response variables such as default rates will be influenced by economic conditions.
However, these econometric variables can be problematic for many reasons. In the context of model stability, econometric variables are likely to take the same values in the review data despite varying in development data. For example, when examining if a PD model is still fit-for-purpose today, model reviewers may examine the current portfolio, but all these observations share the same value as the econometric variables being observed at the same point in time. This distribution with no or little variation will look markedly different to the distribution in the development data (which presumably shows considerable variation, or the econometric variables are unlikely to be significant predictors). Indeed, it would be surprising if the distribution of an econometric variable at review would be similar to the distribution at development. While the PAI might be less problematic than statistics such as the PSI because the PAI measures the extent of extrapolation rather than any differences in the distributions, the PAI is still likely to be high unless the value of the econometric variables at review is near the middle of the distribution at development.
For example, suppose that the three variables in the development data (
Table 1) are all econometric variables and we wish to compare the distribution of these variables at review to this distribution at development. If the review data are observed at the most recent time period, then it is likely all these observations have the same value for the econometric variables. We can therefore ask the question: for which values of these econometric variables will the PAI be greater than 1.1? Since the average squared Mahalanobis distance at development is 2.0, from Equation (2), the squared Mahalanobis distance will have to be less than 2.3 if the PAI is to be less than 1.1. This only occurs for three values of the explanatory variables:
(3, 2, 1)
T, (3, 2, 2)
T, and (3, 2, 2)
T, where the squared Mahalanobis distances are 1.0, 1.4, and 1.8, respectively (
Table 3). These occur a total of 16 out of 50 times in the development data; therefore, there is only a 32% chance that the PAI will be green (<1.1) even if the review data are selected from the distribution of the development data. All other values for the explanatory variables at review result in a PAI greater than 1.1. For the PAI to be less than 1.5, the squared Mahalanobis distance must be less than 3.5, which adds the observations
(5, 3, 1)
T and (1, 1, 2)
T, which occur six times in the development data. Hence, only 44% of the observations in the review data result in a PAI less than 1.5. Thus, even if the value of the explanatory variables at review were randomly selected from the distribution in the development data, there is a 56% chance the PAI is red (>1.5) and only a 32% chance it is green (<1.2).
One solution to this characteristic of the PAI (that is, to be large for models that include econometric variables) is to calculate the PAI without econometric variables (only using the non-econometric variables). Another is to change the cut-offs of >1.1 (amber) and >1.3 (red), recommended by
Taplin and Hunt (
2019) when econometric variables are included. We do not support these modifications. Instead, we discuss why the large PAI might accurately reflect the instability of the model and how this may be a characteristic of inappropriate modelling.
One reason why the PAI can diagnose models with econometric variables as being unstable is due to an over-confidence in the accuracy of these models. This can be due to a phenomenon referred to as pseudo-replication, which occurs when observations are treated as being statistically independent when they are not. For example, default rates for accounts: if the same customer owns several accounts, then it is likely that they default together (not independently). This is because, if a customer is in default on one account, it is likely they are (or will be determined by a bank to be) in default for all their loans. When modelling with econometric variables, it is equally important to recognise that all observations measured at the same point in time will not be independent: when measured at the same point in time, they are likely to have several characteristics in common, such as unobserved economic conditions. Simple logistic regression models will not capture this dependence structure in the data and will consequently over-estimate the precision of regression coefficients. For example, consider the development data in
Table 1 with 50 observations across 14 combinations of the three explanatory variables. If all these variables are econometric, it is likely that these data result from 14 points in time and it is more realistic, in terms of the sampling of the econometric variables, to view the sample size as closer to 14 than to 50. In practical applications, this discrepancy in sample sizes is likely to be much larger (with thousands or tens of thousands of observations across only dozens of points in time).
Hurlbert (
1984) discussed the problem of pseudoreplication in ecology, while a more recent discussion of the phenomenon in business can be found in
Petersen (
2009).
This problem of pseudoreplication can be addressed during model development (for example, using random effects models or using robust standard errors advocated by
Petersen (
2009)), albeit several consequences. First, fewer (if any) econometric variables will be statistically significant and included in a final model. This is problematic for model developers who are required to follow IFRS 9 (which implies that these variables must be included). Second, the higher standard deviation for the estimated parameters of the econometric variables will change the matrix
in Equation (1) for the PAI, and this will effectively downweight the impact of the econometric variables on the PAI relative to other explanatory variables. Essentially, with regard to Equation (2) for the PAI, distances involving econometric variables will become lower. Thus, correctly modelling the development data to avoid pseudoreplication will not only produce models that more accurately capture econometric variables but at least partially correct the interpretation of the PAI through the use of a more realistic variance–covariance matrix
for the estimated model parameters. That is, a model that has a high PAI due to econometric variables might identify inadequacies in the modelling of the development data. Rather than excluding these econometric variables from the calculation of the PAI, it might be prudent to examine whether the model development was inappropriate due to pseudoreplication.
A quick, simple way to avoid pseudoreplication with econometric variables is to develop a PD model with a two-stage process. First, build a model with non-econometric variables (possibly with dummy variables for different points in time to capture econometric effects). Then, aggregate data to overall default rates at each point in time and model the overall default rate at each point in time using econometric variables and the average predicted PD at that time as a covariate. Note that, if the development data consists of 1000 loans at each of 20 points in time, then the first model will use 20,000 observations while the second will use 20 observations. This approach may be inappropriately simplistic when building a model but might be a quick approach for model reviewers who do not wish to develop a model but rather explore whether pseudoreplication exists in the development data. This is not an unreasonable investigation for a model review, especially if it is suspected that the model development process ignored the presence of pseudoreplication.
However, the possibility that models containing econometric variables are intrinsically unstable cannot be ignored. For example, at the time of writing (early 2023), interest rates in most countries have increased dramatically from economic stimulus conditions following COVID-19; historical data showing such a change in economic conditions are so rare that any model developed on historical data is likely to demonstrate low stability.
7. Conclusions
The Prediction Accuracy Index (PAI) is a useful statistic to detect when the stability of a model has deteriorated; however, the literature provides little guidance on how to investigate reasons for a lack of model stability when the PAI is high. In particular, the suggestion by
Taplin and Hunt (
2019) to use the univariate PAI to examine which variable(s) contribute to a high multivariate PAI appears simplistic and problematic because it essentially ignores all the other explanatory variables.
This paper has explored the properties of the multivariate PAI, which has led to recommended approaches to examine if a large PAI is due to individual observations, individual explanatory variables, or a shift in the mean of explanatory variables. This includes the case when several explanatory variables are closely related (such as multiple dummy variables created from one categorical variable). This has several implications for how models should be reviewed, especially when the value of the PAI is high. An important instance of this is when, following IFRS 9 for provisioning, econometric variables are explicitly included in model development. This pracitce is very ambitious (especially compared to the Basel Accord for capital) and may lead to models that fail stability at review due to inadequate modelling practices. This has important implications not only for model developers but also for model reviewers, auditors, regulators, and standard setters.