1. Introduction
In this paper, we propose a multi-desk extension of the spectral backtest of
Gordy and McNeil (
2020). Our test is designed to address the requirement that banks should implement backtests of their risk models at the desk level. Although banks are required to report the results of single-desk value-at-risk backtests at the desk level, our proposal is for a test that simultaneously assesses the quality of all desk-level risk models, with respect to outcomes in the tail.
There are a number of potential advantages of our test. First, it may detect deficiencies in desk models that are missed when trading desk data are aggregated into a single portfolio. Aggregation will result in the loss of information and the netting effect across desks will potentially mask situations where an under-estimation of risk in one trading desk model is compensated for by the over-estimation of risk in another trading desk model. Moreover, when desks are tested jointly, we effectively increase the amount of data with respect to a single test at the level of the trading book, which improves the testing power. Furthermore, by embedding our test in the spectral framework, we are able to test the performance of risk models at a range of confidence levels, and not simply at a single level, as in standard VaR backtesting. To place our proposal in context, we give a brief review of the development of backtesting for the trading book in the following paragraphs.
Over the last three decades, a method of testing value-at-risk (VaR) predictions using historical data, known as backtesting (
Jorion 2007), has become the industry standard for the validation of internal models of market risk. Under the previous regulatory frameworks of Basel II and Basel II.5 (see
BCBS 2006,
2011), backtesting required the comparison of one-day-ahead VaR forecasts at the 99% probability level with the ex-post realised losses and was based on so-called VaR violation indicators. Most proposed backtests for the accuracy of VaR forecasts are of a univariate nature and are based on univariate time series of VaR violations for a single portfolio, such as the entire trading book of a bank.
Christoffersen (
1998) defined criteria, known as the unconditional coverage and independence hypotheses, that should be satisfied by credible VaR forecasts. The former is the requirement that the expected number of violations of VaR forecasts at probability level
over
n time periods should be
, while the latter is the requirement that such violations should occur independently in time. The criteria can also be combined to obtain the conditional coverage hypothesis. An early approach to verifying the unconditional coverage hypothesis can be found in
Kupiec (
1995), while an approach to testing for the independence of the sequence of VaR violation indicators was presented by
Christoffersen (
1998). Tests of the independence hypothesis generally require an assumption about the dependence structure under the alternative hypothesis and can have very limited power to detect departures from independence caused by other forms of serial dependence. Moreover, as pointed out in
Campbell (
2007), joint tests of the unconditional coverage and independence hypothesis are not automatically preferable to separate testing; poorly performing VaR models, which violate only one of the two hypotheses, are less likely to be detected by a joint test than by two separate tests.
Since a reasonable model of a profit-and-loss (P&L) distribution should deliver acceptable VaR estimates at a range of probability levels, the revised regulatory requirements for the measurement of market risk (
BCBS 2019)
1 emphasise that backtesting should be carried out at multiple probability levels beyond the usual 99% level. In these new regulations, the expected shortfall risk measure plays an important role in the formula used to determine the required capital for market risk in the trading book. Since the expected shortfall is defined as the mean of the losses that are greater than the VaR at level
, a good estimate of the expected shortfall requires a P&L forecast that is accurate for VaR estimation at a range of
values in the tail of the distribution.
Kratz et al. (
2018) proposed a simultaneous multinomial test of VaR estimates at different
levels and suggested that such a test could be viewed as an implicit backtest of the expected shortfall. Alternative multilevel VaR tests were also developed by
Campbell (
2007), who proposed a Pearson’s chi-squared test for goodness of fit, and
Pérignon and Smith (
2008), who developed a likelihood-ratio test generalising the unconditional coverage test of
Kupiec (
1995).
Testing the accuracy of VaR forecasts at more than one level implies that we move away from a simple assessment of the validity of VaR to a more thorough assessment of the forecast of the P&L distribution from which the VaR forecast is calculated. If the P&L distribution is adequately estimated, the resulting VaR estimates must be accurate for every level in . In the extreme case that we test at every level, we backtest the complete P&L distribution. One advantage of backtesting the forecast of the P&L distribution (or a region theoreof) is that we exploit much more information in comparison to using the series of VaR violation indicators at a single level. The latter may take only the values one or zero, depending on whether a VaR violation has occurred or not, and, for typical levels in the region of 99%, violations are rare and the resulting indicator data are sparse.
Backtests of the forecast of the P&L distribution can be based on realised probability-integral transform (PIT) values. These are transformations of the realised value of P&L by the cumulative distribution function of the model used to forecast P&L at the previous time point. An ideal forecaster, i.e., a forecaster in the sense of
Gneiting et al. (
2007) who has knowledge of the correct model, would produce independent and uniformly distributed PIT values.
Berkowitz (
2001) proposed the first backtest of this kind based on the transformation of realised PIT values to a normal distribution under the null hypothesis.
The spectral tests of
Gordy and McNeil (
2020) are constructed by transforming PIT values with a weighting function, referred to as a kernel, and are available in both unconditional coverage and conditional coverage variants. Their form is very flexible and they include many of the previously proposed approaches to backtesting as special cases. Under the spectral backtesting philosophy, the risk modelling group or banking regulator can choose the kernel to apply weight to the area of the forecast distribution that is of primary interest—for example, a region in the area around the 99% quantile. Among the tests subsumed in this framework are the test of
Kupiec (
1995), which corresponds to a kernel equivalent to the Dirac measure at a specific probability level
, and the test of
Berkowitz (
2001). The spectral risk measure test of
Costanzino and Curran (
2015) and the expected shortfall test of
Du and Escanciano (
2017) are further special cases obtained by choosing a kernel truncated to tail probabilities.
While the test of
Gordy and McNeil (
2020) can be viewed as an absolute test of model adequacy for a single candidate model, a weighted approach to forecast comparison in specific areas of a forecast distribution was proposed by
Amisano and Giacomini (
2007). Their weighted likelihood ratio test is a relative test that compares the performance of two competing density forecasts and is based on the weighted averages of the logarithmic scoring rule, where the weights are chosen according to the preferences of the risk modelling group.
Diks et al. (
2011) proposed similar tests but based them on conditional likelihood and censored likelihood scoring rules, while
Gneiting and Ranjan (
2011) applied the (quantile-weighted) continuous ranked probability score instead of the logarithmic score.
In contrast to the rich variety of tests available for the backtesting of a univariate series of density forecasts, or comparing competing sets of forecasts, the literature on multivariate backtesting is much more sparse. Multivariate extensions of univariate VaR backtests have not been widely developed, although
Berkowitz et al. (
2011) already sketched a number of ideas. We are aware of two papers by
Danciulescu (
2016) and
Wied et al. (
2016) that deal with the multivariate backtesting of VaR, in the sense of backtesting VaR forecasts for several portfolios (or desks) simultaneously.
Danciulescu (
2016) suggests a test for the unconditional coverage hypothesis and a test for the independence hypothesis using multivariate portmanteau test statistics of the Ljung–Box type, which are applied to the multivariate time series of VaR violation indicator variables. The test for the independence hypothesis simultaneously tests for the absence of cross- and autocorrelations in the multivariate time series of VaR violation indicator variables up to some finite lag
K. In
Wied et al. (
2016), multivariate tests are proposed for the detection of clustered VaR violations, which would violate the conditional coverage hypothesis. They argue that their test can detect the clustering of VaR exceedances for a single desk, which would indicate that the probability of VaR violations is varying over time, as well as the clustering of VaR exceedances across desks at different lags, which would cast doubt on the assumption of independent VaR violations for different desks at different time points.
Unfortunately, due to the sensitivity of the data concerned, there are very few empirical studies of bank-wide P&L backtesting and even fewer studies of desk-level data. One exception is
Berkowitz et al. (
2011), who analysed daily realisations from the P&L distribution and daily forecasts of VaR (calculated by the widely used historical simulation method) for each of four separate business lines at a large international commercial bank. In this study, various univariate tests, including the Markov test of
Christoffersen (
1998) for the conditional coverage hypothesis, the CaViaR test for autocorrelation of
Engle and Manganelli (
2004), and the unconditional coverage hypothesis test of
Kupiec (
1995), were used to backtest the VaR for each trading desk separately. Moreover, the tests were assessed based on their finite sample size and power properties in a Monte Carlo study. The authors found that the VaR models for two out of the four business lines were rejected due to volatility clustering and that the model of a third business line was rejected by the unconditional coverage hypothesis test of
Kupiec (
1995).
The remainder of our paper is organised as follows. In
Section 2, we explain the testing approach based on PIT values and recapitulate the main details of the spectral test of
Gordy and McNeil (
2020). We then show how this may be extended to obtain multivariate spectral tests based on a single spectrum or multiple spectra in
Section 3. In
Section 4, we carry out a simulation study to analyse the size and power of the proposed tests for different amounts of data, different numbers of desks, different choices of kernels, and different deviations from the null hypothesis. Concluding remarks are found in
Section 5.
4. Simulation Study
4.1. Design of the Study
We vary the following variables in the simulation study.
Sample size n. This corresponds to the number of days used in the bank’s backtesting exercise. The length of the backtesting period is typically small (one or two years of daily data corresponding to days on which markets are open) and so we consider and .
Number of desks d. This can be quite large in a bank with extensive trading operations and we consider the values and .
Copula C of PIT values across desks. We assume different dependence structures across desks by sampling PIT values with different copulas. In particular, we use the Gauss copula and the copula of a multivariate t distribution with 4 degrees of freedom. The latter case allows us to see how the properties of the spectral tests are affected by tail dependencies in the PIT data.
Level of dependence across desks. For simplicity, we assume that all desks are equi-dependent by setting the correlation matrix R of the Gauss and t copulas to be an equicorrelation matrix with common parameter , which takes the values or ; more details are given below.
Fraction of misspecified desks. In examining size and power, we vary the fraction of desks that use misspecified P&L distributions in their risk models.
Specification of spectral test. We use a number of different monospectral and bispectral tests as detailed below.
We generate
n independent realisations of the vector
representing the PIT values at time
t for desks
. The vectors
are drawn from the distribution
where
C denotes the copula (Gauss or Student) and
are continuous univariate distribution functions, which are designed to capture the effects of correctly and incorrectly specified desk-level P&L models. The correlation matrix
R, which is used to parameterise the Gauss copula and the Student t4 copula, is an equicorrelation matrix with parameter
or
. Note that, in the former case, the Gauss copula yields a model where the simulated PIT values are independent across desks, while the t4 copula yields a model with dependencies. This is because, even with a correlation matrix
R equal to the identity, the Student t4 copula still has tail dependence, which will tend to lead to very large or very small PIT values occurring together across a number of desks; see, for example,
McNeil et al. (
2015)
2.
For the marginal distributions
, we use a construction first developed in
Kratz et al. (
2018) and also used in
Gordy and McNeil (
2020). We set
, where
is the standard normal distribution function and where
is either the standard normal distribution or the distribution function of a univariate Student t4 distribution scaled to have variance one. If we wish to mimic a desk that is using a correctly specified desk model, we choose the normal and thus obtain
, the distribution function of standard uniform. If we wish to mimic a desk that is using an incorrectly specified model, we choose the scaled t4 and obtain a distribution function
, which is supported on the unit interval
but is not uniform; on the contrary, it is the type of PIT value distribution that would be obtained if the desk were using a model (represented by
) that was lighter-tailed than the true P&L distribution (represented by
) and thus underestimated the potential for large losses (and large gains). It is important to note that the use of these two distributions is simply a device to generate PIT data that either satisfy or violate the null hypothesis; we do not claim that these distributions are in any sense the true distributions. Recall that the null hypothesis is that the PIT vectors
are iid random vectors with uniform marginal distributions but an unknown dependence structure.
For the spectral tests, we selected kernels corresponding to the following three continuous weighting functions g defined on :
- (1)
The uniform weighting function ;
- (2)
The linear weighting ;
- (3)
The exponential weighting function with .
The values and determine the kernel window. We choose and , which gives a symmetric interval around . Note that the functions are non-decreasing, placing more weight on more extreme outcomes (in the right tail).
For the multivariate monospectral Z-tests, the kernel functions listed above lead to three different tests, which we denote, respectively, by SP.U, SP.L, and SP.E. For the multivariate bispectral Z-tests, we will look at two different Z-tests, which combine the continuous kernel functions listed above. The test denoted SP.UL combines the uniform and linear weighting functions. The test denoted SP.UE combines the uniform and exponential weighting functions.
The simulation experiments described in the following sections were all performed using the R package
simsalapar, which is a very flexible tool for conducting large-scale studies with a number of different dimensions (see
Hofert and Mächler 2016). The tables that we provide show observed rejection rates for the null hypothesis in 1000 replications of the simulation experiment.
We use a colouring convention to help with the interpretation of these rejection rates and this is applied differently according to whether the results address the size or power of a test. Simulation results relating to size are colour-coded as follows: good results (observed size smaller or equal 6%) are coloured green; poor results (observed size in range [9–12%]) are coloured pink; very poor results (size above 12) are coloured red; all other values are uncoloured. Simulation results relating to power are colour-coded as follows: good results (observed power above 70%) are coloured green; poor results (observed power in range [30–10%]) are coloured pink; very poor results (power below 10%) are coloured red; all other values are uncoloured.
4.2. Evaluating the CE Method of Correcting for Inter-Desk Dependence
We first investigate the crucial CE method of correcting for the unknown dependence structure across desks. We consider two extreme situations—one in which all desks are correctly specified and one in which all desks are incorrectly specified. The former situation allows us to evaluate the size of the test, i.e., the probability of a significant test result when the null hypothesis holds. The latter situation is one that should certainly be picked up by any backtest with reasonable power.
Results for the monospectral tests are found in
Table 1. These relate to one-sided tests of the null hypothesis, where we are interested in being able to detect the systematic underestimation of tail risk in the right tail of the loss distribution. The nominal level
of the tests is 0.05. The table shows the actual test rejection rates over 1000 replications for different backtest lengths
n, desk numbers
d, copulas
C, distribution functions
, and correlation values
. The field CE shows whether the CE correction method has been used or not.
The rows in which is recorded as “N” address the question of size and we expect values close to the nominal level of the test 0.05. However, when the CE correction method is not implemented, it is clear that the spectral tests are oversized in all cases except where the desks are independent, which is the column corresponding to a Gauss copula and ; in the case of a t4 copula with , the desks are still dependent and the tests are oversized. In the absence of correction for correlation, there are a number of results coloured red, indicating a complete inability to control the size. In contrast, when the CE correction is implemented, the results are coloured green in all cases.
The rows in which is recorded as “t4” address the question of power, since all desks are misspecified. For each of the spectral tests, the power increases with both n and d, as we would expect. However, the power decreases with strengthening dependence across desks; the power for the case C = t4 and is greater than for C = Gauss and , which in turn is greater than for C = t4 and . Increasing levels of dependence can be thought of as effectively reducing the number of independent desk results. Turning to the different spectral tests, the performance shows similar patterns but the SP.L kernel (linear weighting function) seems to give the highest power in this case.
In view of this first set of results, we will apply the CE correction method to the spectral tests in all further experiments.
4.3. Size and Power of Bispectral Tests
We now consider the two bispectral tests under the two scenarios of
Table 1—all desks correctly specified and all desks misspecified. For the bispectral test, we also add results for
backtests, corresponding to 4 years of data. Results are shown in
Table 2. Note that the bispectral tests are two-sided tests.
While the power of these tests is perfect, the size properties are not as good as for the monospectral tests. This is particularly apparent in the case of a backtest of length
; the situation improves for
and the size results are very good for
; if anything, there is evidence of undersizing. We interpret these results as showing that more data are required in order for the estimators of (
10) to give an accurate correction for the unknown desk dependence structure in the bispectral case.
4.4. Evaluating the Effect of Desk Misspecification Rate
We now consider the more realistic situation where only a certain fraction of desks are incorrectly specified. We choose values of this fraction in the set of 25% and 50%. All the misspecified desks use the model in which
is t4. Results for monospectral tests are shown in
Table 3.
When 25% of models are misspecified models, the power is rather low, except in the case where the desks are independent (i.e., have a Gauss copula and correlation ). We attribute this to the intuition that independent data increase the effective sample size, whereas correlated data decrease it. The power increases with both n and d. As the misspecification rate increases, the power increases, as we would expect. The spectral test with a linear weighting function (SP.L) gives the most power, while the other two kernels are comparable.
The results for the bispectral tests SP.UL and SP.UE are reported in
Table 4. In this case, we add results for
, since we have observed that bispectral tests typically require more data for the correlation correction (CE) method to give tests that are well sized. We also add results for a misspecification fraction of 10%.
The results are clearly better than for the monospectral tests and show the advantages of bispectral tests—by using two kernels, we can effectively test for the correct specification of more moments of the distribution of PIT values in the kernel window (see discussion in
Section 2.3). The general observations are the same as for the monospectral tests; the power increases with both
n and
d and with weakening dependence across desks. While it would clearly be best to base backtests on 1000 observations, to obtain tests that are both well sized and powerful, reasonable results are obtained for
even when only 25% of the desks are using misspecified risk models. When only 10% of the desks use misspecified models, the power is clearly weaker, but the test does still have some ability to detect that a number of desks are delivering poor P&L estimates.
Ther is little to choose from between the SP.UL and SP.UE tests. While the latter seems to be slightly more powerful, it also tends to have slightly worse size properties, as seen in
Table 2; we would tend to favour the former for samples of size
.
5. Conclusions
In this paper, we proposed multivariate spectral Z-tests, which can be used to simultaneously backtest trading desk models when the dependence structure of P&L across trading desks is unknown. Multivariate backtests are potentially more powerful than univariate backtests at the level of the trading book, since they exploit a greater amount of data; typical banks can have 50–100 trading desks. Moreover, a simultaneous backtest avoids the problem of aggregating and drawing inferences from a set of single-desk backtests in the presence of unknown dependencies between test results.
The tests that we have developed are a multivariate extension of the spectral tests proposed in
Gordy and McNeil (
2020). They take the form of a Z-test against a normal or chi-squared reference distribution and make use of realised PIT values as input variables. PIT values provide more data about the quality of desk models than indicator variables for VaR violations and their benefits are already being exploited in the USA. It is likely that other regulatory authorities will use this type of information more systematically in the future as well.
The multivariate spectral tests are designed in such a way that the risk modelling group can select a kernel or weighting scheme to emphasise the region of the estimated P&L distribution where model performance is most critical, typically a region in the tail representing large losses. The tests can also provide an indirect validation of the expected shortfall measure, which now has a prominent role in the revised regulation.
We suggested a method of controlling for the unknown dependencies between desks, based on estimating correlations. The resulting tests were generally well sized, although bispectral tests typically required samples of PIT vectors (2 years of daily data), while monospectral tests only required (one year of daily data). However, in realistic situations where only a minority of the used desk models underestimated the risks in their P&L distributions, bispectral tests gave much better power and should generally be preferred to their monospectral counterparts.
The performance of the proposed multivariate tests suggests that they are a valuable addition to a bank’s validation framework. Since they take the form of Z-tests, they are easy to implement and have quick run times. In the event of a significant test result, a bank would be able to implement a post-hoc testing scheme on individual desks to see where the main problems lie.
Finally, the new multivariate backtests could also have an interesting application to backtesting in the banking sector as a whole, based on data for the trading book provided to the regulator by individual banks. Significant test results would be an indication of the potential for spillover effects across banks caused by trading activities and would contribute to the understanding of systemic risk.
Clearly, it would be of interest to apply our tests to actual reported data from banks. However, multi-desk data are not currently available to researchers due to the sensitivities surrounding bank risk reporting both in the EU and the USA. An anonymised study of real data at the trading-book level for US banks is provided by
Gordy and McNeil (
2020) and shows the advantages of the (univariate) spectral backtesting framework over simple VaR backtests; we would certainly expect that these advantages will carry over to the multivariate setting. We hope that, by setting out a methodology for multi-desk backtesting in this paper, we can help to promote the use of desk-level PIT values for model validation and encourage banks and regulators to allow some datasets of realised PIT values to enter the public domain and stimulate further research into the refinement of the methodology.