Bayesian Mixture Modelling for Mortality Projection

: Although a large number of mortality projection models have been proposed in the literature, relatively little attention has been paid to a formal assessment of the effect of model uncertainty. In this paper, we construct a Bayesian framework for embedding more than one mortality projection model and utilise the ﬁnite mixture model concept to allow for the blending of model structures. Under this framework, the varying features of different model structures can be exploited jointly and coherently to have a more detailed description of the underlying mortality patterns. We show that the proposed Bayesian approach performs well in ﬁtting and forecasting Japanese mortality.


Introduction
In the demographic, actuarial, and insurance literature, the two main branches of mortality projection models are the Lee and Carter (1992) model family and the Cairns et al. (2006) (CBD) model family. There have been a large number of their variations, extensions, and applications to date (e.g., Lee 2000;Cairns et al. 2009;Haberman and Renshaw 2011). In most of these works, the selection of mortality projection models is mainly based on certain standard statistical criteria, such as the Akaike information criterion (AIC), Bayesian information criterion (BIC), mean square error (MSE), and mean absolute percentage error (MAPE). Once the "best model" is determined, the common practice is to simply apply the selected model singly to the problem under consideration (e.g., demographic projection, longevity risk pricing and hedging, capital assessment). While this approach is reasonably sound in its own right, it ignores the fact that there is still much uncertainty in actually how close the chosen model is to the true underlying mechanism and that the other suboptimal models may still capture useful aspects that are omitted by the selected model. It would be a serious waste if the other model candidates are simply discarded after putting in all the effort on fitting them to the data and only ranking them by a pre-specified, sometimes subjective, criterion.
There has been relatively little attention on a more proper assessment of the effect of model uncertainty. In this paper, we develop a fully Bayesian framework for incorporating multiple mortality projection models, including the Lee-Carter model, CBD model (with curvature), and age-cohort model. These models would depict different aspects of the data, and it would be useful to integrate them into a coherent framework. In particular, we employ the finite mixture model concept (e.g., Marin et al. 2005;Frühwirth-Schnatter et al. 2018) to estimate the probability of a model candidate and also blend the model structures in a formal manner. More specifically, the overall density function is expressed as a weighted average of the density functions of the individual components. Another perspective of using a finite mixture model is to take it as a semi-parametric approach that provides a more flexible way for handling complex data patterns, in contrast to simply applying a single model with limited features. Under such a framework, the distinct features of different model candidates can lead to a more thorough portrayal of the underlying mortality patterns. In effect, all of the inherent volatility, parameter uncertainty, and model uncertainty are allowed for simultaneously. Comparatively, previous studies in mortality projection generally took care of the first one or two uncertainties (via Monte Carlo simulation or bootstrapping) only. Moreover, the Bayesian approach allows joint estimation of the mortality structure and the time series structure, which avoids the potential estimation bias under the usual two-step procedure in fitting mortality projection models. Using Japanese mortality data, we show that the proposed Bayesian approach can give a better account of the uncertainty in model selection.
The other advantages of Bayesian modelling, in general, include the ability to cope with missing values and the room to embed relevant information into prior distributions. Earlier Bayesian mortality modelling work can be found in Czado et al. (2005), Pedroza (2006), Kogure et al. (2009), Cairns et al. (2011), Li (2014a, 2014b, and Van Berkum et al. (2017). Some applications in pricing or hedging longevity risk have been demonstrated by (Kogure and Kurachi 2010;Kogure et al. 2014), Cairns (2013, and Li et al. (2019).
The remainder of the paper is as follows. Section 2 introduces our Bayesian framework, which integrates the Lee-Carter model, the CBD model, and their time series processes, and presents the numerical results based on Japanese male mortality data. Section 3 sets out another Bayesian framework incorporating the Lee-Carter model and the age-cohort model and discusses the application results. Section 4 gives the concluding remarks.

Bayesian Lee-Carter with CBD
Let X be the variables to be observed, X f be the future values of the variables, Θ be the unknown parameters, and W be the unknown weights of the model candidates. From a Bayesian perspective, the main objective is to deduce the joint posterior distribution p(X f , Θ, W|X), based on which different inferences can be drawn. The parameter estimates can be taken as the mean (or median) of the posterior distribution p(Θ|X). The model candidate with the largest mean of p(W|X) can be regarded as the most probable or optimal model. The predictive distribution of the future values is then derived as p(X f |X). Within the Bayesian framework, the finite mixture model structure can be specified as f (x|Θ, W) = ∑ w i f i (x|θ i ), where f denotes a density function, and the subscript i refers to the ith model component. This mixture becomes particularly useful when the underlying data patterns are complicated and cannot be described sufficiently by one single model. In such a mixture, each model component would not just "compete" in making its own contributions to the overall description of the data features but also supplement the other model components by filling in their gaps.
Suppose m(x, t) is the central death rate at age x in year t. The first part is the Lee-Carter structure: in which α(x) is the age effect, and κ(t) is the period effect with age-specific sensitivity β(x). The two identifiability constraints are ∑ x β(x) = 1 and ∑ t κ(t) = 0. The period effect is then modelled as a random walk with drift κ(t) = κ(t − 1) + d + e(t), where d is the drift term and e(t) ∼ Normal(0, σ 2 e ) is the error term. The second part is the CBD structure with curvature: where for year t, κ (1) (t) is the level of the mortality curve, κ (2) (t) refers to the slope, κ (3) (t) is related to the curvature, x is the average age, and σ 2 is the average value of (x − x) 2 . The three time-varying parameters are then modelled as a three-dimensional random walk with drift K(t) = K(t − 1) + Θ + Ψ(t), in which Θ is the drift vector, and Ψ(t) is the multivariate Risks 2021, 9, 76 3 of 12 normal error vector with mean 0 and covariance matrix Ω. The overall model's density is then specified as below: in which y = ln m(x, t), w 1 and w 2 = 1 − w 1 are the weights of the two components, σ 1 and σ 2 are the standard deviations of the two components, and φ is the standard normal density. It can be deduced that E(Y|Θ, W) = w 1 η 1 + w 2 η 2 and E(Y 2 |Θ, W) = w 1 (σ 2 1 + η 2 1 ) + w 2 (σ 2 2 + η 2 2 ). The prior distributions of all the parameters above are assumed as follows: The hyperparameters are chosen in such a way that the priors are as uninformative as possible. The uniform distribution of w 1 means that there is no preference over any of the two model structures at the start, assuming that there is no relevant prior information on the choice between them. The terms a 1 and a 2 are set to be 2.1, and b 1 and b 2 are set as 1.1 times the corresponding residual variances. The prior variances σ 2 α and σ 2 β are set as 10 times the sample variances ofα(x)'s andβ(x)'s (estimated via singular value decomposition) over age. The drift term mean d 0 and variance σ 2 d are computed from the estimated decrementsκ(t) −κ(t − 1). The term a e is assumed to be 2.1 and b e is taken as 1.1 times the sample variance ofκ(t) −κ(t − 1) over time. The drift mean vector Θ 0 and covariance matrix Ξ are estimated from the trivariate decrementsK(t) −K(t − 1). The term R is taken as n times the sample covariance matrix ofK(t) −K(t − 1), and n is set equal to 4. More details can be found in Kogure et al. (2009) andLi (2014b).
Since it is mathematically intractable to derive a closed-form solution for the joint posterior distribution, we use Markov chain Monte Carlo (MCMC) simulation to provide a random sample from the posterior, in which simulated samples are generated from a Markov chain having its stationary distribution equal to the posterior distribution (Spiegelhalter et al. 2003). We apply the software JAGS (Plummer 2017) to implement the MCMC simulations. It involves the Gibbs sampling method, which simulates from the fully conditional posterior distribution of each variable in sequence. The JAGS programming language operates on the R platform as a package. Note that while one may assume priors that are even more diffuse than those listed above, the computation time would then lengthen substantially, and the program may even crash.
For each run of the MCMC simulation, the initial 5000 iterations are discarded to remove the effect of the initial values, and afterwards, 1000 iterations are obtained. The collected 1000 scenarios are then used for estimating the parameters and probability intervals. As illustrated in the two examples in Figure 1, the resulting autocorrelations (bottom right) in the successive samples of each variable are negligible, which is a strong sign of proper convergence to the stationary distribution. Moreover, the Monte Carlo errors are largely around 5% or smaller of the sample standard deviations, providing further evidence on the convergence of the MCMC samples to the stationary distribution.
remove the effect of the initial values, and afterwards, 1000 iterations are obtained. The collected 1000 scenarios are then used for estimating the parameters and probability intervals. As illustrated in the two examples in Figure 1, the resulting autocorrelations (bottom right) in the successive samples of each variable are negligible, which is a strong sign of proper convergence to the stationary distribution. Moreover, the Monte Carlo errors are largely around 5% or smaller of the sample standard deviations, providing further evidence on the convergence of the MCMC samples to the stationary distribution. We have collected Japanese male mortality data of ages 60 to 89 and years 1970 to 2017 from the Human Mortality Database (HMD 2020) and fit the Lee-Carter model (only), the CBD model (only), and the mixture model to the data under the Bayesian framework as described above. During the data period, the mortality trends and patterns tend to be more complex for males than for females, so it would be interesting to apply the mixture model to the male data. Figure 2 compares the posterior means of ( )

and
(3) ( ) t κ from the Lee-Carter model alone or the CBD model alone against those from the mixture model. It can be seen that the parameter estimates of each structure have their basic shapes rather preserved under the mixture model. We have collected Japanese male mortality data of ages 60 to 89 and years 1970 to 2017 from the Human Mortality Database (Human Mortality Database HMD) and fit the Lee-Carter model (only), the CBD model (only), and the mixture model to the data under the Bayesian framework as described above. During the data period, the mortality trends and patterns tend to be more complex for males than for females, so it would be interesting to apply the mixture model to the male data. Figure 2 compares the posterior means of α(x), β(x), κ(t), κ (1) (t), κ (2) (t), and κ (3) (t) from the Lee-Carter model alone or the CBD model alone against those from the mixture model. It can be seen that the parameter estimates of each structure have their basic shapes rather preserved under the mixture model.  0224. This value suggests that the CBD structure is preferred over the Lee-Carter structure, as it is significantly different from 0.5 (with p-value of 0.01). However, it is still significantly larger than 0, which means that both structures, when used jointly, can potentially complement each other in describing the data. The deviance information criterion (DIC) is defined as the posterior mean of the deviance plus the effective number of parameters. The DIC values of applying the Lee-Carter, CBD, and mixture models are calculated as 20,552, 20,254, and 19,513, respectively. These statistics are in line with the two implications above-while the CBD model (alone) can be selected over the Lee-Carter model (alone), the mixture model actually gives the lowest DIC and is the optimal one amongst the three cases. The blending of the Lee-Carter and CBD structures clearly leads The posterior mean of w 1 in the Bayesian mixture model is 0.4398 with a standard deviation of 0.0224. This value suggests that the CBD structure is preferred over the Lee-Carter structure, as it is significantly different from 0.5 (with p-value of 0.01). However, it is still significantly larger than 0, which means that both structures, when used jointly, can potentially complement each other in describing the data. The deviance information criterion (DIC) is defined as the posterior mean of the deviance plus the effective number of parameters. The DIC values of applying the Lee-Carter, CBD, and mixture models are Risks 2021, 9, 76 5 of 12 calculated as 20,552, 20,254, and 19,513, respectively. These statistics are in line with the two implications above-while the CBD model (alone) can be selected over the Lee-Carter model (alone), the mixture model actually gives the lowest DIC and is the optimal one amongst the three cases. The blending of the Lee-Carter and CBD structures clearly leads to an improvement in the fitting performance. It is also reflected in the residual graphs in Figure 3. There are distinctive patterns in the standardised residuals by cohort under the Lee-Carter model (alone) and by both age and cohort (to a lesser extent) under the CBD model (alone). These patterns are largely removed with the blending of the two structures (except for the more recent cohorts, as both structures have the same issue for that aspect of the data).
The posterior mean of 1 w in the Bayesian mixture model is 0.4398 with a standard deviation of 0.0224. This value suggests that the CBD structure is preferred over the Lee-Carter structure, as it is significantly different from 0.5 (with p-value of 0.01). However, it is still significantly larger than 0, which means that both structures, when used jointly, can potentially complement each other in describing the data. The deviance information criterion (DIC) is defined as the posterior mean of the deviance plus the effective number of parameters. The DIC values of applying the Lee-Carter, CBD, and mixture models are calculated as 20,552, 20,254, and 19,513, respectively. These statistics are in line with the two implications above-while the CBD model (alone) can be selected over the Lee-Carter model (alone), the mixture model actually gives the lowest DIC and is the optimal one amongst the three cases. The blending of the Lee-Carter and CBD structures clearly leads to an improvement in the fitting performance. It is also reflected in the residual graphs in Figure 3. There are distinctive patterns in the standardised residuals by cohort under the Lee-Carter model (alone) and by both age and cohort (to a lesser extent) under the CBD model (alone). These patterns are largely removed with the blending of the two structures (except for the more recent cohorts, as both structures have the same issue for that aspect of the data).   Figure 4 plots the log death rates at ages 60, 70, and 80 and their projected values (predictive means) to the year 2050 with 95% prediction intervals. It is interesting to observe that the mixture model generates the widest prediction intervals compared to the two single models. It can be regarded as a direct consequence of incorporating model uncertainty into the Bayesian modelling and simulations. Note that a proper allowance for all kinds of uncertainties is of critical importance in capital assessment and pricing and hedging of longevity risk (e.g., Haberman et al. 2014 andLi et al. 2017). In particular, the real impact of model uncertainty is often omitted in previous studies on longevity risk. Failure to take model uncertainty into full account could lead to a serious underestimation  Figure 4 plots the log death rates at ages 60, 70, and 80 and their projected values (predictive means) to the year 2050 with 95% prediction intervals. It is interesting to observe that the mixture model generates the widest prediction intervals compared to the two single models. It can be regarded as a direct consequence of incorporating model uncertainty into the Bayesian modelling and simulations. Note that a proper allowance for all kinds of uncertainties is of critical importance in capital assessment and pricing and Risks 2021, 9, 76 6 of 12 hedging of longevity risk (e.g., Haberman et al. 2014 andLi et al. 2017). In particular, the real impact of model uncertainty is often omitted in previous studies on longevity risk. Failure to take model uncertainty into full account could lead to a serious underestimation when valuing pensions and annuities and evaluating their risk allowances.  Figure 4 plots the log death rates at ages 60, 70, and 80 and their projected values (predictive means) to the year 2050 with 95% prediction intervals. It is interesting to observe that the mixture model generates the widest prediction intervals compared to the two single models. It can be regarded as a direct consequence of incorporating model uncertainty into the Bayesian modelling and simulations. Note that a proper allowance for all kinds of uncertainties is of critical importance in capital assessment and pricing and hedging of longevity risk (e.g., Haberman et al. 2014 andLi et al. 2017). In particular, the real impact of model uncertainty is often omitted in previous studies on longevity risk. Failure to take model uncertainty into full account could lead to a serious underestimation when valuing pensions and annuities and evaluating their risk allowances.

Bayesian Lee-Carter with Age-Cohort
Starting arguably from Renshaw and Haberman (2006), the cohort effect has become an important consideration in building a mortality projection model. It refers to unique mortality patterns that are found amongst only those individuals who were born in a given year. However, some authors (Cairns et al. 2009;Beutner et al. 2017) discussed the problems of slow convergence and lack of robustness and the identifiability issues when incorporating the cohort effect together with the period effect. Some possible solutions so far include using an alternative model structure (Haberman and Renshaw 2011) and different identifiability constraints (Hunt and Villegas 2015).
In this section, we explore a different treatment of the cohort effect in order to avoid the identifiability issues. In particular, the second part of the Bayesian framework in Section 2 is now replaced with the following age-cohort structure (Renshaw and Haberman 2006): * * * *

Bayesian Lee-Carter with Age-Cohort
Starting arguably from Renshaw and Haberman (2006), the cohort effect has become an important consideration in building a mortality projection model. It refers to unique mortality patterns that are found amongst only those individuals who were born in a given year. However, some authors (Cairns et al. 2009;Beutner et al. 2017) discussed the problems of slow convergence and lack of robustness and the identifiability issues when incorporating the cohort effect together with the period effect. Some possible solutions so far include using an alternative model structure (Haberman and Renshaw 2011) and different identifiability constraints (Hunt and Villegas 2015).
In this section, we explore a different treatment of the cohort effect in order to avoid the identifiability issues. In particular, the second part of the Bayesian framework in Section 2 is now replaced with the following age-cohort structure (Renshaw and Haberman 2006): where α * (x) is the age effect, and γ * (t − x) is the cohort effect with age-specific sensitivity β * (x). The two constraints required are ∑ x β * (x) = 1 and ∑ c γ * (c) = 0. The cohort effect is then modelled as a random walk with drift γ * (t) = γ * (t − 1) + d * + e * (t), in which d * is the drift term, and e * (t) ∼ Normal(0, σ * 2 e ) is the error term. The corresponding prior distributions are given below: The hyperparameters are selected in the same way as in the previous section. Note that the Lee-Carter structure η 1 would fail to capture all the cohort effect, while this agecohort structure η * 2 would fail to capture all the period effect. They can then serve as a complement to each other, and their combination would take both the period and cohort effects into full account. Figure 5 plots the posterior means of the parameters from the Lee-Carter model (alone), the age-cohort model (alone), and the mixture model. Again, the parameter estimates are quite similar between the single models and the mixture model. The posterior mean of w 1 in the mixture model is 0.4705 with a standard deviation of 0.0205. This time, the value of w 1 is not significantly different from 0.5 (with p-value of 0.15), which means that it is rather hard to select between the Lee-Carter and age-cohort structures. The DIC values of fitting the Lee-Carter, age-cohort, and mixture models are computed as 20,552, 20,644, and 19,709. The mixture model here (Lee-Carter with age-cohort) still delivers the lowest DIC amongst the three cases and so provides a better fitting performance than either of the two single models on its own, though it is slightly not as good as the one (Lee-Carter with CBD) in Section 2. Figure 6 compares the standardised residuals between the three cases. There are ripple patterns by cohort under the Lee-Carter model and by calendar year under the age-cohort model, which are understandable considering how the two single models allow for the period and cohort effects, respectively. Under the mixing of the two structures, by contrast, the residuals look much more randomly scattered, which means that the mixture model provides a better fit to the data. The hyperparameters are selected in the same way as in the previous section. Note that the Lee-Carter structure 1 η would fail to capture all the cohort effect, while this agecohort structure 2 η * would fail to capture all the period effect. They can then serve as a complement to each other, and their combination would take both the period and cohort effects into full account. Figure 5 plots the posterior means of the parameters from the Lee-Carter model (alone), the age-cohort model (alone), and the mixture model. Again, the parameter estimates are quite similar between the single models and the mixture model. The posterior mean of 1 w in the mixture model is 0.4705 with a standard deviation of 0.0205. This time, the value of 1 w is not significantly different from 0.5 (with p-value of 0.15), which means that it is rather hard to select between the Lee-Carter and age-cohort structures. The DIC values of fitting the Lee-Carter, age-cohort, and mixture models are computed as 20,552, 20,644, and 19,709. The mixture model here (Lee-Carter with age-cohort) still delivers the lowest DIC amongst the three cases and so provides a better fitting performance than either of the two single models on its own, though it is slightly not as good as the one (Lee-Carter with CBD) in Section 2. Figure 6 compares the standardised residuals between the three cases. There are ripple patterns by cohort under the Lee-Carter model and by calendar year under the age-cohort model, which are understandable considering how the two single models allow for the period and cohort effects, respectively. Under the mixing of the two structures, by contrast, the residuals look much more randomly scattered, which means that the mixture model provides a better fit to the data.    Figure 7 shows the corresponding log death rates at different ages with their projected values and prediction intervals. The mixture model here clearly produces much wider prediction intervals than those from the single age-cohort model. Again, these differences highlight the importance of allowing for model uncertainty adequately, or else there would be a nontrivial possibility of significantly underestimating the extent of longevity risk, leading to serious future financial losses for an annuity provider or a pension plan sponsor.  Figure 7 shows the corresponding log death rates at different ages with their projected values and prediction intervals. The mixture model here clearly produces much wider prediction intervals than those from the single age-cohort model. Again, these differences highlight the importance of allowing for model uncertainty adequately, or else there would be a nontrivial possibility of significantly underestimating the extent of longevity risk, leading to serious future financial losses for an annuity provider or a pension plan sponsor. We then conduct an out-of-sample analysis and divide the data into two periods: 1970-1999 for fitting the models and 2000-2017 for assessing the forecast accuracy. For the Bayesian Lee-Carter, CBD, and age-cohort (single) models, and the two Bayesian mixture models, the mean absolute errors (MAE) of the projected log death rates over 2000-2017 are calculated as 0.0382, 0.0462, 0.1284, 0.0439, and 0.0363, respectively, as displayed in Table 1. It means that the Lee-Carter with age-cohort mixture model gives the best forecast accuracy while the single age-cohort model gives the worst performance. The corresponding mean square errors (MSE) are also in line with the MAE results. Figure 8 demonstrates the observed and simulated log death rates at different ages under these two models. As shown, the single age-cohort model fails to predict the mortality trends when compared with the Lee-Carter with age-cohort mixture model. For the former, not just the projected trends deviate much from the observed trends, but also the prediction intervals fail to capture the observed trends mostly. For the latter, by contrast, the observed trends lie well within the wider prediction intervals generally. We then conduct an out-of-sample analysis and divide the data into two periods: 1970-1999 for fitting the models and 2000-2017 for assessing the forecast accuracy. For the Bayesian Lee-Carter, CBD, and age-cohort (single) models, and the two Bayesian mixture models, the mean absolute errors (MAE) of the projected log death rates over 2000-2017 are calculated as 0.0382, 0.0462, 0.1284, 0.0439, and 0.0363, respectively, as displayed in Table 1. It means that the Lee-Carter with age-cohort mixture model gives the best forecast accuracy while the single age-cohort model gives the worst performance. The corresponding mean square errors (MSE) are also in line with the MAE results. Figure 8 demonstrates the observed and simulated log death rates at different ages under these two models. As shown, the single age-cohort model fails to predict the mortality trends when compared with the Lee-Carter with age-cohort mixture model. For the former, not just the projected trends deviate much from the observed trends, but also the prediction intervals fail to capture the observed trends mostly. For the latter, by contrast, the observed trends lie well within the wider prediction intervals generally.

Concluding Remarks
In this paper, we devise a Bayesian framework for integrating multiple mortality projection model structures via the setting of a finite mixture model. In this framework, the different characteristics of the model components involved can be exploited in a joint and coherent manner in order to enhance the capacity in modelling complex mortality patterns. We demonstrate that the proposed Bayesian mixture modelling approach generates superior fitting and forecasting performances on Japanese mortality data when compared to the mere use of a single model. The new approach would lead to a more sufficient risk allowance in pension and annuity valuation. As the precise impact of model uncertainty is often overlooked in previous mortality projection studies, it is intended that this work would fill in some of the gaps in the current literature.
There are a few areas that would worth further research. First, three or more model structures can be combined altogether as a finite mixture model. It would be interesting to blend the Lee-Carter, CBD, age-cohort, and potentially many other models and test such a comprehensive mixture on various countries' data; however, the computation time would then increase substantially as more structures are included. Second, the proposed approach can be extended to multi-population cases, such as both sexes, different socioeconomic classes or states in a country, and neighbouring countries. If one wants to ensure mortality coherence between the subpopulations, each of the model components in the mixture must possess the mortality coherence property itself.

Concluding Remarks
In this paper, we devise a Bayesian framework for integrating multiple mortality projection model structures via the setting of a finite mixture model. In this framework, the different characteristics of the model components involved can be exploited in a joint and coherent manner in order to enhance the capacity in modelling complex mortality patterns. We demonstrate that the proposed Bayesian mixture modelling approach generates superior fitting and forecasting performances on Japanese mortality data when compared to the mere use of a single model. The new approach would lead to a more sufficient risk allowance in pension and annuity valuation. As the precise impact of model uncertainty is often overlooked in previous mortality projection studies, it is intended that this work would fill in some of the gaps in the current literature.
There are a few areas that would worth further research. First, three or more model structures can be combined altogether as a finite mixture model. It would be interesting to blend the Lee-Carter, CBD, age-cohort, and potentially many other models and test such a comprehensive mixture on various countries' data; however, the computation time would then increase substantially as more structures are included. Second, the proposed approach can be extended to multi-population cases, such as both sexes, different socioeconomic classes or states in a country, and neighbouring countries. If one wants to ensure mortality coherence between the subpopulations, each of the model components in the mixture must possess the mortality coherence property itself.