The Impact of Model Uncertainty on Index-Based Longevity Hedging and Measurement of Longevity Basis Risk

We investigate the impact of model uncertainty on hedging longevity risk with index-based derivatives and assessing longevity basis risk, which arises from the mismatch between the hedging instruments and the portfolio being hedged. We apply the bivariate Lee–Carter model, the common factor model, and the M7-M5 model, with separate cohort effects between the two populations, and various time series processes and simulation methods, to build index-based longevity hedges and measure the hedge effectiveness. Based on our modeling and simulations on hypothetical scenarios, the estimated levels of hedge effectiveness are around 50% to 80% for a large pension plan, and the model selection, particularly in dealing with the computed time series, plays a very important role in the estimation. We also experiment with a modified bootstrapping approach to incorporate the uncertainty of model selection into the modeling of longevity basis risk. The hedging results under this approach may approximately be seen as a “weighted” average of those calculated from the different model candidates.


Introduction
Continual increase in longevity worldwide remains a serious concern for pension plan sponsors and annuity providers. The major issue is the presence of longevity risk, which is the risk of paying more than expected because of unanticipated mortality improvements. Broadly speaking, there are three ways to mitigate longevity risk (e.g., Cairns et al. 2008). The first is to transfer the unwanted risk to an insurer or reinsurer by paying a premium. The problem of this usual approach is that insurers and reinsurers may also have limited appetite, and that a failure of a major player could cause disastrous systemic outcomes. The second way is natural hedging, which makes use of the opposite movements between the values of annuities and life insurances arising from changes in mortality levels (e.g., Li and Haberman 2015). Although certain large institutions may have the resources and economies of scale to offer both lines of products and so exploit this hedging effect, many other financial entities do not have such necessary conditions to follow suit.
The third way is the use of capital market solutions, such as insurance securitization (e.g., Cowley and Cummins 2005), and mortality-and longevity-linked securities (e.g., LCP (Lane Clark & Peacock LLP) 2012; Coughlan et al. 2007). The former involves packaging insurance and business risks into securities, such as bonds with coupon and principal payments depending on the performance of This paper is organized as follows. In Section 2, we set forth a list of two-population mortality projection models, time series processes, and simulation approaches selected for our analysis. In Section 3, we examine the hedge effectiveness under different model settings and assumptions and simulated scenarios, and provide a discussion on the impact of model uncertainty. Finally, we give our concluding remarks and set forth some suggestions for future research in Section 4.

Mortality Modeling and Simulations
Recently, there is an emerging branch of literature that focuses on modeling multiple populations jointly. Haberman et al. (2014) and Li et al. (2014) provide a comprehensive review of several multi-population mortality projection models for measuring longevity basis risk. Although there are probably more than 30 models proposed in the literature, we select the ones below for our analysis, as they can be deemed as "key representatives" of the major "families" of mortality projection models.
The first belongs to the Lee and Carter (1992) family, in which there are some possible options (noted as Model 1a, Model 1b, Model 1c, and Model 1d respectively) for modeling the time-varying parameters: (VARIMA( p , 1, 0)) (1d) Figure 1. A modeling framework for longevity basis risk.

Mortality Modeling and Simulations
Recently, there is an emerging branch of literature that focuses on modeling multiple populations jointly. Haberman et al. (2014) and Li et al. (2014) provide a comprehensive review of several multi-population mortality projection models for measuring longevity basis risk. Although there are probably more than 30 models proposed in the literature, we select the ones below for our analysis, as they can be deemed as "key representatives" of the major "families" of mortality projection models.
The first belongs to the Lee and Carter (1992) family, in which there are some possible options (noted as Model 1a, Model 1b, Model 1c, and Model 1d respectively) for modeling the time-varying parameters: ln m x,t,i = α x,i + β x,i κ t,i + ι t−x,i ; (Lee-Carter model with cohort) (1) with K t = Θ + K t−1 + ∆ t , (bivariate random walk with drift) (1a) or κ t,1 = θ + κ t−1,1 + δ t , (random walk with drift) κ t,2 = a 0 + a 1 κ t,1 + ω t , (co − integrated process) (1b) The term m x,t,i is the central death rate at age x in year t of population i, in which i = 1 refers to the reference population and i = 2 refers to the book population. The parameter α x,i is the general age schedule, κ t,i is the mortality index over time, K t = (κ t,1 , κ t,2 ) , and β x,i is the age sensitivity of the log death rate to the mortality index. The cohort parameter ι t−x,i is added when there are significant patterns in the residuals against cohort year (e.g., Haberman and Renshaw 2011). It can be modeled with an autoregressive integrated moving average (ARIMA) process for future projections and simulations, though this is not required in the analysis in the next section. Following Cairns et al. (2009), the first and last five cohorts are excluded from the fitting procedure due to a lack of data.
The first method suggested in Carter and Lee (1992) is fitting the Lee-Carter model to each population separately and then modeling the relationship between the two resulting mortality indices. This approach involves the use of a bivariate random walk with drift (1a) naturally, in which Θ is the vector drift term and ∆ t is the Gaussian vector error term. In contrast, the third method in Carter and Lee (1992) is treating the two mortality indices as a co-integrated process (1b), where θ, a 0 , and a 1 are the parameters of the process, and δ t and ω t are independent Gaussian error terms. Moreover, Yang and Wang (2013) proposed using a vector error correction model (VECM) of order p (1c), in which Θ is the vector constant term, Π and Γ r are the matrix components, and ∆ t is the Gaussian vector error term. Finally, we also consider a vector autoregressive integrated moving average (VARIMA) process (1d), which has been explored by Chan et al. (2014) in modeling other mortality indices. The order chosen is (p, 1, 0), Θ is the vector constant term, Γ r is the autoregressive matrix, and ∆ t is the Gaussian vector error term. The VARIMA in (1d) is in line with the VECM in (1c) and serves as an extension of the random walk in (1a) and (1b) in terms of allowing for more flexibility in modeling the various mortality indices, for comparison purposes. Please note that all the options (1a) to (1d) would generally lead to non-coherence in mortality projections, i.e., the ratio of projected death rates (central estimates) between the two populations at each age group does not converge to a constant over time (e.g., Cairns et al. 2011).
The second is the Li and Lee (2005) family, which is an extension of the Lee-Carter model and assumes a common factor for the two populations. Again, there are a few possible choices (noted as Model 2a, Model 2b, Model 2c, and Model 2d respectively) in modeling the temporal parameters: γ r,i,j κ t−r,i,j + δ t,i,j , AR(p); correlated δ t,1, j and δ t,2,j (2c) The parameter α x,i is the general age schedule of population i, κ c t is the mortality index of the common factor for both populations with age sensitivity β c x , and κ t,i,j is the time component of the jth additional factor for population i with age sensitivity β x,i,j . The common mortality index is modeled as a random walk with drift as usual, with d as the drift and ε t as the Gaussian error. Originally, Li and Lee (2005) used only one additional factor; later, Li (2013) proposed using n additional factors where necessary, and Yang et al. (2016) further suggested adding different cohort parameters such as ι t−x,i . Li and Lee (2005) assumed that κ t,i,1 followed an autoregressive (AR) process of order one, and that the error terms are independent between the populations (2a). On the other hand, Li (2013) tested a more general AR(p) process for each κ t,i, j (2b), and Zhou and Li (2016) assumed that the error terms δ t,i,1 's are correlated between the populations (2c). The parameters θ i,j and γ r,i,j are the constant and autoregressive terms respectively. Lastly, we also include a vector autoregressive (VAR) process of order p for K t,j = κ t,1, j , κ t,2, j (2d), which has been adopted by Haberman et al. (2014) for modeling some other mortality indices. The notation Θ j is the vector constant term, Γ r,j is the autoregressive matrix, and ∆ t,j is the Gaussian vector error term. All the choices (2a) to (2d) can result in coherence in mortality projections, i.e., the ratio of projected death rates between the two populations at each age group tends to a constant over time, if the selected time series processes are weakly stationary. More technical details of time series modeling can be found in Tsay (2002).
The final one is an extension of the Cairns-Blake-Dowd (CBD) model by Cairns et al. (2006) and is proposed by Haberman et al. (2014), being referred to as the M7-M5 model. There are also several possible time series processes to choose from (noted as Model 3a, Model 3b, Model 3c, and Model 3d respectively) under this model: (multivariate random walk with drift) (multivariate random walk with drift) The term q R x,t is the mortality rate at age x in year t of the reference population, κ R t,1 , κ R t,2 , and κ R t,3 represent the level, slope, and curvature of the mortality curve in year t (e.g., Cairns et al. 2009), and ι R t−x is the cohort parameter. The difference in the logit mortality rate between the book and reference populations (i.e., logit q B x,t − logit q R x,t ) is then modeled with another CBD structure of κ B t,1 and κ B t,2 , which explain the differences between the two populations, together with another cohort parameter ι B t−x . Please note that the original M7-M5 model has a common cohort parameter for both populations, while we use a separate cohort parameter for each population in order to allow for the different cohort effects in our data. The notation x is the average age of the age range in the data, and σ 2 is the average value of (x − x) 2 .
In contrast to families (1) and (2) above, where both populations are modeled in parallel, Haberman et al. (2014) fitted the reference population first and then the differences between the two populations. One major argument for this treatment is that there are usually much more data for the reference population than for the book, and so it would be more appropriate to base the main trends on the reference population and consider the differences of the book population afterwards. Haberman et al. (2014) assumed that K R t = κ R t,1 , κ R t,2 , κ R t,3 follows a multivariate random walk with drift, K B t = κ B t,1 , κ B t,2 follows a VAR(1) process, and ∆ R t and ∆ B t are independent, whereas we adopt a more general VAR(p) process here for K B t (3a). Accordingly, we also consider three other alternatives. First, similar to replacing the option (1a) with (1d), we replace the multivariate random walk with drift with a VARIMA(p, 1, 0) process for K R t (3b). Second, we replace the VAR(p) process with a bivariate random walk without drift for K B t (3c), as we have seen in our analysis that the calculated κ B t,1 and κ B t,2 fluctuate around a constant level in many cases. Moreover, in line with the options (2b) and (2c), we test the correlation assumption between ∆ R t and ∆ B t (3d). All the vectors and matrices of parameters and error terms have similar meanings as previously. Please note that the options (3a), (3b) and (3d) can generate coherence in the projected mortality rates approximately if the selected VAR process is weakly stationary.
Besides the central estimates' coherence, the simulated future variability is also a significant feature, as the relative potential movements between the book mortality and reference mortality is one main driver of longevity basis risk. This issue, however, has largely been overlooked in the current literature, which usually focuses on the mortality projection models and the (weak) stationarity of the time series processes, but not on the simulated future variability resulting from time series modeling. Based on the M7-M5 model and the CAE+Cohorts model (which is an extension of the Lee-Carter model), Li et al. (2017) found that the behavior of simulated future variability of the "book minus reference" component is the most important time series modeling assumption. It should be noted that while the random walk and integrated autoregressive processes above produce increasing variability over time in the simulations, the autoregressive processes (not integrated ones) generate bounded variability instead. The combined or alternative use of these time series processes would give rise to varying hedging results, as shown in the next section.
In addition to the Lee-Carter, Li-Lee, and CBD families described above, there are also several other similar approaches for modeling longevity basis risk (e.g., Plat 2009;Coughlan et al. 2011;Ngai and Sherris 2011;Tsai et al. 2011;Cairns 2013;Li et al. 2016). Furthermore, while there is an abundant amount of time series processes developed in econometric studies, it appears that only a handful of them would actually turn out to be useful for modeling longevity basis risk in practice, due to the usually short length and small amount of annual book data being available.
Regarding the simulation methods, we consider four different choices (e.g., Li 2014). The first is simply Monte Carlo simulation as in Lee and Carter (1992), where only process error is allowed for. The second is the semi-parametric bootstrapping approach proposed in Brouhns et al. (2005), and the third is the residuals bootstrapping approach suggested by Koissi et al. (2006), both of which include process error and also parameter error. These simulation methods have initially been catered for single-population mortality projection models, and some adaptations are needed for the two-population models in this paper. In particular, in addition to the choice of resampling the reference and book residuals separately, we also group together the two populations' residuals in each age-time cell as an individual (bivariate) data point for resampling, such that more information about the relationships between the two populations may be embedded into the simulated samples (Li and Haberman 2015).
Apart from measuring the hedge effectiveness by testing various models and investigating different simulated scenarios, we also adopt the modified semi-parametric bootstrapping approach in Yang et al. (2015) and incorporate process error, parameter error, and model error simultaneously. It is effectively an extension of the bootstrapping approach suggested by Brouhns et al. (2005). First, a pseudo-sample of the number of deaths is simulated from the Poisson distribution with the observed number of deaths as the mean. Suppose there are two or more competing model candidates under consideration. Each model in turn is fitted to the pseudo-data sample and the corresponding model parameters are estimated. Based on some pre-determined model selection criteria, the most "optimal" model is selected for this particular pseudo-data sample. The time series of the selected model's temporal parameters are then simulated into the future. Lastly, future death rates are produced from the estimated and simulated parameters of the selected model. So far in this process, the pseudo-data sample is used for generating only one future scenario via one selected model. The entire process above can then be repeated iteratively to create, say, 5000 future scenarios in total 1 , in which different models may be chosen in different scenarios. In this way, all the three errors are embedded in the simulated future death rates. This modified bootstrapping approach provides a different perspective to understand the impact of model uncertainty, though we acknowledge that the simulation time can be significantly lengthened, especially when more competing models and selection criteria are included 1 The size of 5000 iterations appears to produce fairly stable results in the next section. We find that the estimated levels of hedge effectiveness usually differ by less than 3% in magnitude between repeated runs on the same case study.
Risks 2020, 8, 80 7 of 25 in the process. Again, the bootstrapping process needs to be properly adapted to the two-population models. Figure 2 below gives a list of the modeling techniques we experiment with to tackle the different error components. In the next section, we will discuss in detail the similarities and differences between the results on some hypothetical hedging scenarios generated by all these modeling and simulation methods.

Hedge Effectiveness
We have collected two sets of data for the book and reference populations. The first set is composed of the male assured lives and pensioners data (book) from the Continuous Mortality Investigation (CMI), and England and Wales male data (reference) from the Human Mortality Database (HMD 2017), for the period from 1983 to 2006. The second set comprises Australia, New Zealand, Japan, Taiwan, Hong Kong (Census and Statistics Department 2017), and Singapore (Department of Statistics 2017) male data from the HMD and governmental statistics departments, for the years 1980 to 2016. Since Asia-Pacific insured data are scarce, we use the data of New Zealand, Taiwan, and Singapore, with relatively smaller sizes, as a proxy for the book population, and the data of their larger neighbors for the reference population. The age range considered is from 60 to 89. Figure 3 plots the log central death rates of three age groups over time. It can be seen that the mortality declining trends of different populations or regions are roughly in line with one another in the past few decades. In general, the death rates of the assured lives and pensioners are lower than those of English and Wales. One potential problem of the assured lives and pensioners data is that there may have been different contributors to the data over the period and there would then be some extent of heterogeneity or inconsistency. Judging simply from the graphs below, it seems that the issue may not be overly significant. Australian death rates are slightly lower than New Zealand death rates, and Japan has experienced lower mortality levels than Taiwan. Hong Kong has had lower mortality experience than Singapore, while the latter is catching up fairly quickly in the last decade.

Hedge Effectiveness
We have collected two sets of data for the book and reference populations. The first set is composed of the male assured lives and pensioners data (book) from the Continuous Mortality Investigation (CMI), and England and Wales male data (reference) from the Human Mortality Database (HMD (Human Mortality Database) 2017), for the period from 1983 to 2006. The second set comprises Australia, New Zealand, Japan, Taiwan, Hong Kong (Census and Statistics Department, Hong Kong Special Administrative Region 2017), and Singapore (Department of Statistics, Ministry of Trade & Industry, Republic of Singapore 2017) male data from the HMD and governmental statistics departments, for the years 1980 to 2016. Since Asia-Pacific insured data are scarce, we use the data of New Zealand, Taiwan, and Singapore, with relatively smaller sizes, as a proxy for the book population, and the data of their larger neighbors for the reference population. The age range considered is from 60 to 89. Figure 3 plots the log central death rates of three age groups over time. It can be seen that the mortality declining trends of different populations or regions are roughly in line with one another in the past few decades. In general, the death rates of the assured lives and pensioners are lower than those of English and Wales. One potential problem of the assured lives and pensioners data is that there may have been different contributors to the data over the period and there would then be some extent of heterogeneity or inconsistency. Judging simply from the graphs below, it seems that the issue may not be overly significant. Australian death rates are slightly lower than New Zealand death rates, and Japan has experienced lower mortality levels than Taiwan. Hong Kong has had lower mortality experience than Singapore, while the latter is catching up fairly quickly in the last decade.
We mainly consider a hypothetical situation of a large pension plan with 10,000 members for a particular cohort. All the pensioners are aged 65 and every pensioner receives $1 on survival of each year in the next 25 years. Suppose that the pension plan financier attempts to minimize its longevity risk exposure by building a longevity hedge with index-based S-forwards (e.g., LLMA (Life and Longevity Markets Association) 2010), and that the S-forward at every future age is available for the same birth cohort as the pensioners. For a floating rate receiver, the payoff on maturity of a S-forward is equal to the actual survivor index (observed on maturity) minus the forward survivor index (set at time 0), in which the survivor index is the percentage of the reference population who are still alive on maturity. Assume that the current forward values are equal to the central estimates, setting a zero risk premium for convenience, and that the interest rate is constant at 1% p.a. throughout the period, that the mortality declining trends of different populations or regions are roughly in line with one another in the past few decades. In general, the death rates of the assured lives and pensioners are lower than those of English and Wales. One potential problem of the assured lives and pensioners data is that there may have been different contributors to the data over the period and there would then be some extent of heterogeneity or inconsistency. Judging simply from the graphs below, it seems that the issue may not be overly significant. Australian death rates are slightly lower than New Zealand death rates, and Japan has experienced lower mortality levels than Taiwan. Hong Kong has had lower mortality experience than Singapore, while the latter is catching up fairly quickly in the last decade. We mainly consider a hypothetical situation of a large pension plan with 10,000 members for a particular cohort. All the pensioners are aged 65 and every pensioner receives $1 on survival of each year in the next 25 years. Suppose that the pension plan financier attempts to minimize its longevity risk exposure by building a longevity hedge with index-based S-forwards (e.g., LLMA 2010), and that the S-forward at every future age is available for the same birth cohort as the pensioners. For a floating rate receiver, the payoff on maturity of a S-forward is equal to the actual survivor index (observed on maturity) minus the forward survivor index (set at time 0), in which the survivor index is the percentage of the reference population who are still alive on maturity. Assume that the current forward values are equal to the central estimates, setting a zero risk premium for convenience, and that the interest rate is constant at 1% p.a. throughout the period, considering the current low interest rate environment. The valuation date is taken as just after the end of the data period.
After fitting the above-mentioned models to the data and carrying out the simulations as in the previous section, we use the simulated future death rates of the book population and the binomial distribution to further simulate the number of surviving pensioners in each future year (e.g., Haberman et al. 2014). We can then obtain random samples of the present value of the pension plan liability. On the other hand, we use the simulated future death rates of the reference population to determine the random S-forwards payoffs, which are discounted to the valuation date. The notional amounts of the S-forwards are calculated by numerical optimization in order to maximize the level of hedge effectiveness (e.g., Li et al. 2017). The weights are mostly in the range of 0.6 to 1.1 (per person) in different cases of our analysis. Figure 4 demonstrates the longevity hedging scheme based on the use of index-based S-forwards. Please note that the counterparties can be a financial exchange or intermediary that brings the market investors (hedge providers) and the pension plan (hedger) into conducting standardized transactions. After fitting the above-mentioned models to the data and carrying out the simulations as in the previous section, we use the simulated future death rates of the book population and the binomial distribution to further simulate the number of surviving pensioners in each future year (e.g., Haberman et al. 2014). We can then obtain random samples of the present value of the pension plan liability. On the other hand, we use the simulated future death rates of the reference population to determine the random S-forwards payoffs, which are discounted to the valuation date. The notional amounts of the S-forwards are calculated by numerical optimization in order to maximize the level of hedge effectiveness (e.g., Li et al. 2017). The weights are mostly in the range of 0.6 to 1.1 (per person) in different cases of our analysis. Figure 4 demonstrates the longevity hedging scheme based on the use of index-based S-forwards. Please note that the counterparties can be a financial exchange or intermediary that brings the market investors (hedge providers) and the pension plan (hedger) into conducting standardized transactions. We mainly consider a hypothetical situation of a large pension plan with 10,000 members for a particular cohort. All the pensioners are aged 65 and every pensioner receives $1 on survival of each year in the next 25 years. Suppose that the pension plan financier attempts to minimize its longevity risk exposure by building a longevity hedge with index-based S-forwards (e.g., LLMA 2010), and that the S-forward at every future age is available for the same birth cohort as the pensioners. For a floating rate receiver, the payoff on maturity of a S-forward is equal to the actual survivor index (observed on maturity) minus the forward survivor index (set at time 0), in which the survivor index is the percentage of the reference population who are still alive on maturity. Assume that the current forward values are equal to the central estimates, setting a zero risk premium for convenience, and that the interest rate is constant at 1% p.a. throughout the period, considering the current low interest rate environment. The valuation date is taken as just after the end of the data period.
After fitting the above-mentioned models to the data and carrying out the simulations as in the previous section, we use the simulated future death rates of the book population and the binomial distribution to further simulate the number of surviving pensioners in each future year (e.g., Haberman et al. 2014). We can then obtain random samples of the present value of the pension plan liability. On the other hand, we use the simulated future death rates of the reference population to determine the random S-forwards payoffs, which are discounted to the valuation date. The notional amounts of the S-forwards are calculated by numerical optimization in order to maximize the level of hedge effectiveness (e.g., Li et al. 2017). The weights are mostly in the range of 0.6 to 1.1 (per person) in different cases of our analysis. Figure 4 demonstrates the longevity hedging scheme based on the use of index-based S-forwards. Please note that the counterparties can be a financial exchange or intermediary that brings the market investors (hedge providers) and the pension plan (hedger) into conducting standardized transactions.  Table 1 lists the BIC (Bayesian Information Criterion) values of fitting the three two-population mortality projection models to the various datasets via an iterative updating scheme based on Newton's method. In the mortality forecasting literature, the BIC is much more popular than the Akaike Information Criterion (AIC), as it is generally perceived that the BIC imposes a higher  Table 1 lists the BIC (Bayesian Information Criterion) values of fitting the three two-population mortality projection models to the various datasets via an iterative updating scheme based on Newton's method. In the mortality forecasting literature, the BIC is much more popular than the Akaike Information Criterion (AIC), as it is generally perceived that the BIC imposes a higher penalty on the use of parameters and can point to more parsimonious models. For the first three hedging scenarios, the M7-M5 model (3) produces the lowest BIC values (the italic figures). For the Taiwan pension plan hedged by a Japan index, and also the Singapore pension plan hedged by a Hong Kong index, the bivariate Lee-Carter model (1) gives the lowest BIC, followed very closely by the common factor model (2). Although the BIC is probably the most frequently used selection criterion in the mortality projection modeling literature, it is fundamentally a measure on how good the past patterns are being captured under parsimonious use of parameters, which may or may not lead to accurate or reasonable future predictions. Moreover, as noted earlier, there is always some level of uncertainty in model selection, and the BIC alone (and even together with residuals examination) may point to a less appropriate model given the random fluctuations in the data and the lack of information on possible outliers. In fact, several other model aspects should also be considered (e.g., Cairns et al. 2009). In terms of allowing for longevity basis risk, the most relevant model aspects would be the coherence property, behavior of simulated future variability, and simulation method.  Table 2 summarizes the behavior of the central estimates and simulated variability of each time series process considered. Table 3 provides the details of the selected orders of the fitted time series processes for all the datasets. Figure 5 then shows the various time-varying parameters estimated (solid lines), their central estimate projections (dashed lines), and their simulated 95% prediction intervals (dotted lines) using the UK assured lives data under different mortality projection models and time series processes. As shown, the random walk, co-integrated, VECM, and integrated autoregressive processes all produce linear projected (central estimate) trends and increasing variability across time in the simulations, whereas the (weakly stationary) autoregressive processes show convergence and bounded variability. The widths of the prediction intervals vary between different time series processes. For example, within the Lee-Carter family, the VECM(2) generates narrower intervals for both κ t,1 and κ t,2 , and the VARIMA(1,1,0) produces wider intervals for κ t,2 but narrower intervals for κ t,1 . For the Li-Lee family, the AR(3) yields wider (bounded) intervals for κ t,1,1 compared to the AR(1) and VAR(1). For the CBD family, the VARIMA(2,1,0) leads to wider intervals for κ R t,1 , κ R t,2 , and κ R t,3 compared to the multivariate random walk with drift, and the bivariate random walk without drift leads to unbounded intervals for κ B t,1 and κ B t,2 , in contrast to the bounded intervals from the VAR(2). Some goodness-of-fit test results of the fitted time series processes are given in the Appendix A. Figure 6 plots the corresponding book-to-reference ratios of projected death rates at ages 75 and 85. In agreement with the descriptions in the previous section, models (1a) to (1d) produce divergent ratios of projected death rates between the two populations. Moreover, the projected ratios diverge in various directions at different ages even under the same model. By contrast, models (2a) to (2d) and (3a) to (3d) yield convergent ratios of projected death rates at each age, which are in the range of around 0.6 to 0.7 as illustrated in the plots. Figure 7 then displays some simulated ratios using 10 randomly picked simulated paths. There are clear differences between model (1a) and models (2a) and (3a). Under model (1a), the simulated ratios can move to values very different from the projected ones over time, while under models (2a) and (3a), the simulated ratios fluctuate around the projected levels. The potential variations under model (1a) are much greater than those under models (2a) and (3a), in which model (3a) demonstrates more obvious fluctuations but within a shorter distance from the projected values than model (2a). Since demographic basis risk arises from possible deviations between the book and reference mortality movements (due to demographic or socioeconomic differences), the behavior of these simulated ratios of death rates would have a significant implication on the calculated levels of hedge effectiveness, which will be discussed in the following numerical analysis. Table 2. Major characteristics of time series processes under different mortality projection models.
Note: The selected orders of the fitted time series processes are based on the partial autocorrelation functions and matrices, whether the autocorrelations and cross-correlations of the residuals are insignificant, the estimated parameters are significant, and the resulting fitted time series process (2b, 2c, 2d, 3a, 3b, 3d) is weakly stationary.
The weak stationarity can be tested via examining whether all the roots/eigenvalues of the characteristic equation fall inside the unit circle. It can be also tested simply by projecting with the fitted time series process over a long  Figure 6 plots the corresponding book-to-reference ratios of projected death rates at ages 75 and 85. In agreement with the descriptions in the previous section, models (1a) to (1d) produce divergent ratios of projected death rates between the two populations. Moreover, the projected ratios diverge in various directions at different ages even under the same model. By contrast, models (2a) to (2d) and (3a) to (3d) yield convergent ratios of projected death rates at each age, which are in the range of around 0.6 to 0.7 as illustrated in the plots. Figure 7 then displays some simulated ratios using 10 randomly picked simulated paths. There are clear differences between  Figure 8 offers a different view on the simulated book and reference death rates at age 80 at different points of time using the UK pensioners data. It can be seen that over time, the simulated death rates move to the southwest direction due to mortality improvements, and the simulated variability (both the vertical and horizontal ranges) increases because of higher uncertainty in more distant future. However, again, there are significant differences between the three models. For models (2a) and (3a), the dependence between the simulated book and reference death rates increases gradually with time, whereas for model (1a), the weak dependence appears to remain at a low level. It means that for the former the two death rates are very unlikely to deviate significantly further from each other in the long term, but for the latter the two death rates can move in varying directions. The different model structures clearly have a large impact on the resulting association between the two populations in the simulations. The discussion of the numerical study below will explore the underlying reasons for these observations. These simulated differences in the dependence would affect the calculation of basis risk and hedge effectiveness significantly, as shown later.  (1a), the simulated ratios can move to values very different from the projected ones over time, while under models (2a) and (3a), the simulated ratios fluctuate around the projected levels. The potential variations under model (1a) are much greater than those under models (2a) and (3a), in which model (3a) demonstrates more obvious fluctuations but within a shorter distance from the projected values than model (2a). Since demographic basis risk arises from possible deviations between the book and reference mortality movements (due to demographic or socioeconomic differences), the behavior of these simulated ratios of death rates would have a significant implication on the calculated levels of hedge effectiveness, which will be discussed in the following numerical analysis.   Table 4 sets out the estimates of the standard deviation and the 99.5% Value-at-Risk (VaR) (minus the mean) of the present value of the pension plan liability, as a percentage of the expected present value of the liability. On average, the standard deviation is about 1.7% and the 99.5% VaR is around 4.3% of the mean, in which the ratio between the two measures is close to that of a standard normal distribution. For the assured lives and pensioners datasets, the bootstrapping approaches generally give larger VaR estimates than those from Monte Carlo simulation (the cells within borders), while for the other datasets, the differences are less obvious. The potential heterogeneity issue of the former may be the reason they demonstrate a greater effect of parameter uncertainty. The simulated variability from the Li-Lee family (2a, 2b, 2c, 2d) (the shaded rows), but not the other two families, is fairly robust to the selection of time series process. Model (3c) produces by far the largest estimates (the bolded figures) among all the models. It appears that the bivariate random walk for K B t together with the multivariate random walk for K R t in model (3c) produces much more variability in the book simulations compared with the other models.  Figure 8 offers a different view on the simulated book and reference death rates at age 80 at different points of time using the UK pensioners data. It can be seen that over time, the simulated death rates move to the southwest direction due to mortality improvements, and the simulated variability (both the vertical and horizontal ranges) increases because of higher uncertainty in more distant future. However, again, there are significant differences between the three models. For models (2a) and (3a), the dependence between the simulated book and reference death rates increases gradually with time, whereas for model (1a), the weak dependence appears to remain at a low level. It means that for the former the two death rates are very unlikely to deviate significantly further from each other in the long term, but for the latter the two death rates can move in varying directions. The different model structures clearly have a large impact on the resulting association between the two populations in the simulations. The discussion of the numerical study below will explore the underlying reasons for these observations. These simulated differences in the dependence would affect the calculation of basis risk and hedge effectiveness significantly, as shown later.  In line with Coughlan et al. (2011) and Li et al. (2017), the level of hedge effectiveness is defined as [1 -risk hedged /risk unhedged ] × 100%. The two quantities risk unhedged and risk hedged are the pension plan's longevity risk exposure before and after implementing the hedge. This measure gives the proportion of the original longevity risk exposure that is being transferred away. The remaining risk can then be seen as a result of longevity basis risk. We consider the longevity risk exposure as the standard deviation and the 99.5% VaR minus the mean of the present value of the pension plan liability. Please note that the 99.5% VaR measure is highly relevant to the Solvency Capital Requirement (SCR) calculation under Solvency II. It is a very important consideration for insurance practitioners and regulators in Europe. Figure 9 presents the levels of hedge effectiveness (i.e., the proportion of the initial risk that is reduced) under various models and simulation methods using different datasets. The major observations and implications from the numerical results are listed below: 1.
The estimated levels of hedge effectiveness are largely between 50% to 80%. The clear exceptions include those produced from models (1a), (1d), and (3c), which are about 20% or lower. The first two of these three cases are non-coherent and generate increasing variability in the time-varying parameters' future simulations for both the book and reference populations. Although the estimated correlations in their Gaussian error terms are quite high (about 0.3 or greater), the relationship between the book and reference mortality levels under the two models looks too weak, when considering their past movements in the data (see Figure 3), and there is an overestimation of longevity basis risk. Comparatively, while models (1b) and (1c) are also non-coherent, their co-integration and error correction structures lead to some extent of co-movements between the two populations in the simulations, which is reflected in the estimated hedge effectiveness. The last exception, model (3c), is coherent, but it produces increasing simulated variability over time for the "book minus reference" component's time-varying parameters. The resulting effect is that the simulated book and reference mortality movements could deviate significantly from each other, progressively so over the long run, leading to an overestimation of longevity basis risk and so an underestimation of hedge effectiveness. By contrast, the other model choices do not suffer from this problem.  Table 4 sets out the estimates of the standard deviation and the 99.5% Value-at-Risk (VaR) (minus the mean) of the present value of the pension plan liability, as a percentage of the expected present value of the liability. On average, the standard deviation is about 1.7% and the 99.5% VaR is around 4.3% of the mean, in which the ratio between the two measures is close to that of a standard normal distribution. For the assured lives and pensioners datasets, the bootstrapping approaches generally give larger VaR estimates than those from Monte Carlo simulation (the cells within borders), while for the other datasets, the differences are less obvious. The potential heterogeneity issue of the former may be the reason they demonstrate a greater effect of parameter uncertainty. The simulated variability from the Li-Lee family (2a, 2b, 2c, 2d) (the shaded rows), but not the other two families, is fairly robust to the selection of time series process. Model (3c) produces by far the largest estimates (the bolded figures) among all the models. It appears that the bivariate random walk for B t  together with the multivariate random walk for R t  in model (3c) produces much more variability in the book simulations compared with the other models.
In line with Coughlan et al. (2011) and Li et al. (2017), the level of hedge effectiveness is

2.
The results from the 2b,2c,2d) are generally quite robust to the choice of time series modeling, as long as the fitted time series processes are weakly stationary. Under the Li-Lee mortality structure, both populations are governed by the common factor, in which the common mortality index has increasing variability over time in the simulations, while the additional population-specific factors' time-varying parameters have bounded variability in the simulations under all the four choices. Consequently, the former variability would dominate eventually, and any deviance between the simulated book and reference mortality movements is unlikely to continue to grow. The figures from models (2b) and (2c) are slightly lower than those from model (2a) (by only about 3% in magnitude on average; more obvious for the pensioners dataset), due to the higher selected orders in the former. However, in general there is not much difference in the results between models (2b) and (2c), and also between models (2a) and (2d), which suggest that the additional dependence (via correlation or vector autoregression) is unlikely to have a material impact on assessing longevity basis risk. Besides, though the common factor provides a convenient way to capture the link between the two populations, the common factor model (2) requires using the same data length for both populations in maximum likelihood estimation. In practice, the length of book data is usually shorter, and approximation methods or Bayesian techniques may be adopted to fit the model under the presence of missing data. 3.
The figures from models (3a), (3b), and (3d) of the CBD family tend to be higher than those from models (2a) and (2d) (3a) and (3d), which again suggest that the additional dependence (via correlation) does not have an obvious effect on measuring longevity basis risk. For model (3b), as noted previously, the prediction intervals for κ R t,1 , κ R t,2 , and κ R t,3 based on the VARIMA process are often different to those based on the multivariate random walk with drift in model (3a). Some subsequent effect can be seen in the differences between the figures from models (3a) and (3b), in which the latter ones tend to be lower. 4.
The hedging results estimated from Monte Carlo simulation tend to be higher than those from the bootstrapping approaches (by about 7% in magnitude on average) for the assured lives and pensioners datasets. Since performing Monte Carlo simulation directly on the error terms of the time series processes allows for only process error but not parameter error, there would be an underestimation of longevity basis risk and hence an overestimation of hedge effectiveness. But for the other datasets (New Zealand, Taiwan, and Singapore), there are no such obvious differences, which indicate that the effect of parameter uncertainty is not material for these proxy book data with a large size and more stable population compositions and mortality patterns. In addition, there are no clear, significant differences between the results estimated from the two residuals bootstrapping approaches, in which one involves resampling the reference and book residuals separately and the other groups together the two sets of residuals in each age-time cell as an individual data point for resampling. It means that the additional dependence from linking the residuals does not appear to have any effect on longevity basis risk calculation. Some random differences in the results between the semi-parametric bootstrapping and the residuals bootstrapping can also be seen, but most of them are small and do not show any particular patterns.

5.
For a demonstration of applying the modified semi-parametric bootstrapping approach, we integrate models (1b), (2a), and (3a) which have earlier been shown to produce more reasonable estimates of hedge effectiveness. Following Yang et al. (2015), the optimal model for each pseudo-data sample in the bootstrapping process is selected based on the BIC. That is, the BIC values of fitting the three mortality projection models are compared, and the one with the lowest value is chosen for that pseudo-data. Note that besides a single statistical criterion, a mix of other quantitative and even qualitative criteria may also be used, though the selection rules will then be more complex, and the computation time will lengthen.  (2a) and (1b) share a split of 58% and 42%. Furthermore, as shown in Figure 9 (last row), regarding the Singapore portfolio, the final hedging results from this approach may be perceived as a "weighted" average of those calculated separately from models (1b) and (2a), in which the "weights" are determined by how well the two models are fitted to each of the 5000 pseudo-data samples. 6.
The estimated levels of hedge effectiveness are quite close between using the standard deviation and 99.5% VaR in most cases. This observation may result from the fact that the simulated distributions of the pension plan liability are fairly symmetric and do not have a heavy tail, under all the mortality projection models, time series processes, and simulation methods considered, both before and after hedging. We have also checked the results based on the 95% VaR and the observations are similar. 7.
When the size of the pension plan is reduced to 1000 members, the levels of hedge effectiveness estimated from those coherent models drop to mostly around 20% to 40% (not shown here). By contrast, when the pension plan size is infinite (i.e., the step of using the binomial distribution to simulate the number of lives is omitted), the estimated levels of hedge effectiveness largely rise by about 10% or more in magnitude from the initial setting of 10,000 members. The effect of sampling basis risk is significant, which pinpoints that index-based longevity hedging would be more feasible for either very large pension plans, foundations joined by small pension plans, or reinsurers who have accumulated sizable longevity risk exposures from smaller insurers and pension plans. 8.
There are some differences in the hedging results between using the five datasets. On average, after integrating both process error and parameter error via the bootstrapping approaches, the highest level of hedge effectiveness is given by using the New Zealand dataset, followed by the Taiwan, assured lives, Singapore, and pensioners datasets. It has been mentioned earlier that parameter error would be immaterial for book data with stable population compositions and mortality patterns, which may explain some of the results here. Another possible reason for the greater hedge effectiveness shown by the New Zealand pension plan hedged by an Australia index is the geographical and cultural closeness between Australia and New Zealand. Both neighbors are island nations in the South Pacific and have very close connections historically, socially, and economically, and they would tend to have more concurrent mortality movements than otherwise. 9.
Under models (2a), (2b), (2c), (2d), (3a), (3b), and (3d), the calculated levels of hedge effectiveness of the annual cash flows (not the present value; not shown here) are indeed very low in the early years, but then increase significantly across time, i.e., the dependence between the two populations' mortality levels grows with time in the simulations. For the Li-Lee model types, the common mortality index has increasing variability, while the additional time-varying parameters have bounded variability. Similarly, for the M7-M5 model types, the reference component's time-varying parameters have increasing variability, but the book component's time-varying parameters have bounded variability. The resulting model effect is that the simulated variability of the differences between both populations would have lesser impact gradually. Hence the association between the two populations increases over time (see Figure 8). 10. The weights of the S-forwards are numerically optimized to maximize the hedge effectiveness.
Excluding models (1a), (1d), and (3c) which generate unreasonable simulations, for the assured lives dataset, the weights are roughly about 0.5 (per person) in the first half of the age range, 0.7 in the third quarter, and 0.9 in the last quarter. For the pensioners dataset, the weights are around 0.7 in the first half of the age range and 0.9 in the second half. For the New Zealand, Taiwan, and Singapore datasets, most of the weights are approximately equal to one for the whole range. Interestingly, the major patterns in the weights estimated are not too different between the various models. It should be noted that these weights are subject to the technical limitations of the optimization process, uncertainty of the model choice, and random variations in the simulated samples. In addition, in practice, it would be impossible to enter into such numerically precise amounts of S-forward positions due to liquidity issues. Accordingly, we have also tested the hedging results using the approximate weights noted above, and we realize that the corresponding reductions in the levels of hedge effectiveness are actually quite small, mostly being a few percent in magnitude.

Concluding Remarks
In this paper, we study the impact of model uncertainty on hedging longevity risk via index-based derivatives and measuring longevity basis risk. We adopt three families of mortality projection models, using several time series processes and simulation methods, to calibrate index-based S-forwards, construct longevity hedges, and assess the resulting hedge effectiveness. Notably, we test various combinations of mortality models and time series processes different to those in the existing literature and examine their impact in detail. Overall, the estimated levels of hedge effectiveness are mostly about 50% to 80% for a large pension plan, which is in line with those in Li et al. (2017) using different models and datasets to those adopted here, but the precise levels still depend heavily on the selected model. In particular, while the choice of mortality models has certain effects, it appears that the selection of time series processes has a far greater impact on the index-based hedging results. Unfortunately, the length of the assured lives and pensioners data available is short, and application of more sophisticated time series processes cannot be further investigated. Apart from comparing the results from different models, the semi-parametric bootstrapping approach can also be extended to include selection criteria and test several models simultaneously. However, those models or time series processes with unrestricted simulated variability in the differences between the book and reference populations lead to irrational hedging results and should be avoided. In practice, the final model choice relies not only on a quantitative analysis of (usually limited) past book data but also on some extent of qualitative judgement about uncertain future mortality trends and the user's own knowledge or preference.
There are a few possible areas for future research. So far, we focus on static hedging, under which the hedging tools are not rebalanced in the future. By contrast, dynamic hedging requires constant rebalancing of the hedged portfolio, and it would be interesting to explore the feasibility of dynamic index-based longevity hedging. For simplicity, we also assume that the forward rates are equal to the central estimates, implying a zero premium on longevity risk. To further incorporate the costs of longevity hedging, it would be useful to investigate different risk-neutral methods (e.g., Cairns et al. 2006) for pricing index-based derivatives. Moreover, Li et al. (2017) considered some simple model extensions to take structural mortality changes and mortality jumps into account when assessing the hedge effectiveness. More sophisticated methods such as regime-switching models (e.g., Milidonis et al. 2011) and copulas (e.g., Wang et al. 2015) can further be tested in modeling these extreme events. In addition, we use a single information criterion in the modified semi-parametric bootstrapping process. Alternatively, a more complex set of rules consisting of multiple quantitative and qualitative criteria may provide a more realistic view of model uncertainty. Lastly, Bayesian techniques can also be adopted to allow for model uncertainty directly, instead of using the modified bootstrapping approach above. For instance, two or more competing models can be integrated into the Bayesian framework by setting a prior distribution for all the model candidates (e.g., Cairns 2000).