Mortality Forecasting with an Age-Coherent Sparse VAR Model

: This paper proposes an age-coherent sparse Vector Autoregression mortality model, which combines the appealing features of existing VAR-based mortality models, to forecast future mortality rates. In particular, the proposed model utilizes a data-driven method to determine the autoregressive coefﬁcient matrix, and then employs a rotation algorithm in the projection phase to generate age-coherent mortality forecasts. In the estimation phase, the age-speciﬁc mortality improvement rates are ﬁtted to a VAR model with dimension reduction algorithms such as the elastic net. In the projection phase, the projected mortality improvement rates are assumed to follow a short-term ﬂuctuation component and a long-term force of decay, and will eventually converge to an age-invariant mean in expectation. The age-invariance of the long-term mean guarantees age-coherent mortality projections. The proposed model is generalized to multi-population context in a computationally efﬁcient manner. Using single-age, uni-sex mortality data of the UK and France, we show that the proposed model is able to generate more reasonable long-term projections, as well as more accurate short-term out-of-sample forecasts than popular existing mortality models under various settings. Therefore, the proposed model is expected to be an appealing alternative to existing mortality models in insurance and demographic analyses.


Introduction
The ongoing improvement of human life expectancy around the world has made longevity risk, the risk that people live longer than expected, an increasingly important risk for many demographic, economic, and insurance practices. This has urged the academia to better understand the driving factors of mortality improvements, and design effective statistical tools to provide accurate and credible mortality projections.
The past few decades have witnessed rapid developments of mortality forecasting research. Among existing methods, one popular class of model is the so called factor-based models. These models use factor representations to summarize mortality patterns of high dimensional data sets by one or few factors. Mortality projections are then obtained by extrapolating the estimated factors. To date, the most widely used factor-based models include the Lee-Carter model (Lee and Carter 1992) and the Cairns-Blake-Dowd (Cairns et al. 2006) model. Many extensions of these two models have been proposed in the literature (see, among many others, Booth et al. 2006;Li et al. 2015;Renshaw and Haberman 2006). While the Lee-Carter and the CBD models have been demonstrated to be effective in mortality forecasting, they are not able to explicitly model mortality correlations between different populations, as only one population can be fitted at a time. To address the modeling of joint mortality dynamics, Li and Lee (Li and Lee 2005) extend the Lee-Carter model to a multi-population context, in which mortality rates of all populations are assumed to follow a common systematic trend. In particular, mortality projections generated by the Li-Lee model is population-coherent, i.e., the projected mortality rates of different populations population context. We find that, due to the age-coherent feature, the CSVAR model generates smoother long-term projection of age-specific mortality rates. In particular, the projected mortality improvements are more similar across ages-they are more pronounced for the old ages and less substantial for the very young ages compared to those produced by the Lee-Carter model and the SVAR model. As a result, the proposed CSVAR generates higher point forecasts of life expectancy at birth than the Lee-Carter model and the STAR model at all forecast horizons. Moreover, in the out-of-sample forecast analysis, the proposed CSVAR model is able to produce more accurate out-of-sample forecasting results than the Lee-Carter model (and the Li-Lee model in the multi-population case) and the SVAR model with different choices of sample size and age groups. Furthermore, the proposed CSVAR model produces projected life expectancy closer to the realized values over the out-of-sample forecast period in both populations than the Lee-Carter model and the STAR model. Therefore, it seems that the proposed CSVAR model is able to generate both the long-term forecasts and the short-term out-of-sample forecasts, and could therefore be a appealing alternative to the existing models in life insurance and demographic analysis.
The remainder of the paper is organized as follows. Section 2 reviews the Lee-Carter model and the Li-Lee model. Section 3 reviews the existing VAR-based mortality models, especially the SVAR model, and introduces the age-coherent CSVAR model. Section 4 discusses the empirical analysis. Section 5 concludes the paper.

The Factor-Based Model
Suppose we have mortality data of N ages, each age with T years of observations. The Lee-Carter (LC) model (Lee and Carter 1992) summarizes the systematic mortality trends of the N ages by a common factor. Formally, the log central mortality rate at age x in year t, y x,t , follows the specification given by: where a x is the mortality level, i.e., the average mortality rate over time at age x, k t is the period effect, i.e., the systematic mortality trend common to all ages, b x is the age effect at x, i.e., the sensitivity of y x,t to k t , and ε x,t is the normal residual term with mean 0 and variance σ 2 ε x . As noted by Lee and Carter (1992), Equation (1) is not identifiable without normalization constraints. For example, one could multiply b x with a constant c and divide k t by the same constant, and reach the same fitting results. In Lee and Carter (1992), the following normalization constraints are imposed: Given the first constraint, a x is set to the mean of y x,t over the sample considered. The Lee-Carter model is then estimated by singular value decomposition (SVD) instead of the usual ordinary least square approach in the original paper 1 . Given the estimated a x and b x , k t is adjusted to match the fitted total number of deaths to the observed values in each year t. This adjustment rebalances the contribution of age-specific mortality by assigning greater weights to ages with a larger number of death.
While a x and b x are assumed to be constant, the period effect is often modelled by a time-series process. In particular, Lee and Carter (1992) assumes a random walk with drift specification, which is adopted by many later studies: where the drift term d measures the average annual change in k t , and e t i.i.d.
∼ N(0, σ 2 e ). Based on the time-series specification in Equation (3), future mortality rates can be projected by extrapolating the period effect k t . Specifically, the expected h-step-ahead mean forecasts of the period effect and the log central death rate are given by: where T is last year of the sample. The Lee-Carter model is a single-population model, i.e., it focuses on the mortality of one population at a time. This constraint is later relaxed by the Li-Lee model (Li and Lee 2005), which incorporates mortality dynamics of multiple populations simultaneously. The joint modeling of mortality dynamics of multiple populations are important not only in demographic analysis ), but also for various insurance practices, including the risk management of insurance policies for small portfolios and the pricing of the innovative index-based longevity-linked securities and retirement products (Chen et al. 2020;Li 2018;Lu 2018, 2019). In the Li-Lee model, the log central death rates are represented by a common factor and a population-specific factor. Suppose there are I populations in total, the log central death rate for the i-th population at age x and year t, y x,t,i , follows the specification given by: where a x,i is the average mortality level at age x in the ith population, B x and K t represent the common age effect and period effect, k t,i is the population-specific period effect with respect to the ith population, and b x,i is the corresponding population-specific age effect. Finally, ε x,t,i is the normal population-specific error term with mean 0 and variance σ 2 ε x,i . Similarly, a set of normalization constraints are imposed to ensure identifiability: Similar to the Lee-Carter model, the common period effect K t can be modelled as a random walk with drift process. On the other hand, the population-specific period effects k t,i , i = 1, ..., I are fitted by stationary autoregressive processes to ensure coherent forecasts in the long term. Specifically, the time-series specifications of the period effects are given by: where a 0,i and |a 1,i | < 1 are the autoregressive parameters and e t,i is the Gaussian error term with mean 0 and variance σ 2 e i . Stationarity of the k i,t processes guarantees that deviations of the projected mortality of each population from each other will not grow infinitely in the long run. Formally, the projected mortality rates are coherent for different populations in the long-run ifŷ x,T+h,i −ŷ x,T+h,j = O p (1) for i = j. We see that this condition is indeed satisfied if all k i,t s are stationary, as in this case the long-term mortality trends of all ages and populations are driven by the single common period effect K t , and k i,t s only represent the short-term fluctuations around the common mortality trend. Therefore, the Li-Lee model is indeed able to generate coherent mortality projections across populations. However, as noted by recent studies (for example, Li and Lu 2017), the coherent property only holds for projected mortality rates of the same age. Specifically, for an arbitrary different age z = x, the Li-Lee model will not lead toŷ x,T+h,i −ŷ z,T+h,j = O p (1) for any i and j. Therefore, the mortality projections generated by the Li-Lee model is not age-coherent.

The Vector-Autogression-Based Models
As an alternative to factor-based models, VAR-based mortality models have been a rapidly emerging class of mortality models in recent literature. Compared to the factorbased models, the VAR-based are more flexible and thus able to capture more complex time, age, and cohort mortality dependence (Guibert et al. 2019;Li and Lu 2017). However, the application of VAR-based models in mortality forecasting leads to new challenge. In particular, unconstrained VAR models typically have a larger number of parameters to be estimated, while observations in mortality data are often limited. For example, suppose we have N age groups and only one population is considered. Even in the simplest VAR(1) case, if the first order lag of all the N ages are included, then the total number of parameters will be p = N(N + 1) (including N intercepts), while the total number of observations is NT. In mortality analysis, we typically have only a few decades of annual data, while the number of ages could be above 100. Therefore, there are typically more unknown parameters than observations in the standard VAR framework without any parameter constraints.

The Sparse VAR Model
A recent effort to address the curse of dimenisonality issue in the VAR mortality models is made by Guibert et al. (2019), who employ a sparse VAR (SVAR) model on mortality improvements. First, they model the dynamics of the mortality improvement rates, y x,t = y x,t − y x,t−1 , rather than the mortality rates themselves. Under the assumption that (y x,t ) t is an I(1) process for all ages, the dependent variables ( y x,t ) t are I(0) and therefore stationary. As a result, standard VAR models can be used, without the need to consider co-integration relations within mortality data. Second, using an elastic-net (ENET) penalty estimation, the SVAR model adopts a pure data-driven method to select the non-zero coefficients in the estimation process. In this way the coefficient matrix will be sparse, i.e., the majority of autoregressive coefficients will be set to zero. Formally, when only the first order lage is considered, the mortality improvement model is given by: where Y t = ( y 1,t , y 2,t , ..., y N,t ) is the N × 1 vector of mortality improvement rates. ε t is assumed to follow a multi-Gaussian distribution with mean 0. M is an N × 1 vector estimated by the sample averages of Y t . B is the coefficient matrix, and the sparsity (frequency and location of zeros) of B is determined by the LASSO (L1) penalty during the estimation process without any constraints. More specifically, the objective function to be minimized is given by: where β i,j is the ith row jth column element of B, and λ is the L1 penalty parameter. A larger value of λ will shrink more β i,j to be exactly 0. The selection of λ is performed via the cross validation with ten-fold, and the estimates of parameters are then obtained accordingly. Details can be found in Friedman et al. (2010). After estimating Model (8), the forecasting is performed as follows: with h > 1. As a result, it holds thatŶ T+h = Y T +∑ h l=1 Ŷ T+l . In a I-population case, we may let Y t be an N I × 1 vector given by ( y 1,t,1 , ..., y N,t,I ) , where y x,t,i is the mortality improvement rate of the i-th population at age x and year t. Other variables and parameters of Equation (8) can be redefined accordingly in a straightforward manner. In all cases, the sparsity of B depends on the tuning parameter λ in the estimation process, which is derived via the usual cross-validation procedure. Given Equations (8)-(10), we see thatŷ x,T+h −ŷ x+k,T+h = O p (1) + O p (h)(m x −m x+k ) for two arbitrary ages x and x + k in the SVAR model, because y x,t is stationary for any x. As a result, without any constraints onm x andm x+k , the projected mortality rates of any two ages will be diverging in the long-run, and thus the projection generated by the SVAR model is not age-coherent either.

The Coherent Sparse VAR Model
In this paper we propose a age-coherent extension of the SVAR model introduced in Section 3.1 by generalizing the intercepts in the projection, i.e.,M in Equation (10), will be extended to a series of stationary time-dependent processes. Specifically, the x-th element ofM,m x , will be generalized to a process (m x,h ) h which varies over the projection horizon and converges in expectation to a constant universal to all ages, and therefore the projected mortality rates of all ages will be non-diverging in the long-run. To this end, many stationary processes can be chosen. In this paper, we illustrate the proposed method using the hyperbolic decay process. In time-series analysis, the hyperbolic decay is related to the concept of long memory (Feng and Shi 2017;Gao et al. 2020;Ho and Shi 2020), representing the type of decay with speed slower than that of the short memory ones, such as the geometric or exponential decay processes (Hosking 1981). An application of the hyperbolic decay in the mortality modelling and forecasting can be found in Feng et al. (2020). Formally, under the assumption of hyperbolic decay, the intercept used in the h−step ahead forecast,m x,h , is given by: wherem x is defined in the same way as in Equation (8),m * is long-term mean ofm x for all xs, and the hyperbolic parameter δ h (d x ) is defined as: When the hyperbolic parameter d x falls between 0 and 1, it holds that δ h (d x ) → 0 when h → ∞, andm x,h will eventually converge (decay) tom * . Furthermore, the speed of decay is slower (resp. faster) for larger (resp. smaller) values of d x . The extension from a constantM to the hyperbolic decay processes of intercepts then leads to age-coherent projection of mortality rates. More specifically, for any two ages x and x + k, the distance between the projected mortality rates h steps ahead is given by: which will stay bounded in the long run. In this paper, we letm * be the sample mean of all m x , and refer to the model with the time-varying intercepts given in Equation (11) as the coherent SVAR (CSVAR) model.
Despite the aforementioned desirable feature of age-coherent forecasts, the determination of d x is a non-trivial issue in the CSVAR model. There are two major challenges. First, without parameter restrictions, the introduction of d x could increase the number of parameters by N (d 1 , ..., d N ). To address this issue, an appropriate dimensionality reduction technique is required. In particular, as argued in Li and Lu (2017), mortality changes of neighboring ages are typically rather smooth. Therefore, it is reasonable to assume that d x is a smooth function of x for each age to cope with the empirical patterns. Furthermore, it is argued in existing studies that mortality declined will be lower at older ages (see, for example, Li et al. 2013). This suggests that appropriate functional forms should be imposed such that d x is smaller for larger xs, and thus the correspondingm x,h s will converge to m * more slowly. Second, as the parameters d x s only play a role in the projection phase, there is no data to identify the optimal parametric structure of d x (over x). In particular, the historical data were already used to estimateM. Thus, in order to identify the parametric structure of d x , estimation procedures such as cross-validation should be employed.
To deal with the first challenge, in this paper we use the inverse Epanechnikov kernel evaluated at the last observation to characterize the parametric structure of d x . Inverse Epanechnikov kernel is a parsimonious approach to construct smooth functions. More specifically, let τ be the scaled index x/N with x ∈ (1, ..., N), the (adjusted) inverse Epanechnikov kernel evaluated at N is determined by The parameter b, ranging from 0 to 1, is the bandwidth of the inverse Epanechnikov kernel. For example, consider ages 0-100, the inverse Epanechnikov kernels evaluated at age 100 is displayed in Figure 1 for b =0.1, 0.25, 0.5, 0.75 and 1. We see that the estimated kernel 1 − K b (τ − 1) is constant for all ages up to a "cut-off" age, and then become decreasing afterwards. Specifically, the cut-off age is determined by the bandwidth parameter b. For example, if b = 0.1, then the cut-off is the 90% percentile of all ages considered. In other words, the kernel will be constant for the first 90% of ages. On other other hand, when b = 1, the cut-off age is the first age in the sample, and thus the kernel will be decreasing over the whole age group. Finally, the kernel will converge to 0.25, regardless the value of b. Therefore, the inverse Epanechnikov kernel provides a convenient tool to specify the smooth parametric structures of d x . Next, we can model d x as d , so that d x is positive and bounded by 1, and has a declining speed of convergence over age. All in all, the proposed parametric structure of d x requires only two additional parameters to be estimated for the CSVAR model, compared to the SVAR model: the hyperbolic parameter of the first age d 1 , and the bandwidth of the inversed Epanechnikov kernel b.
After reducing the number of parameters in the hyperbolic decay process, we now discuss the estimation of the two parameters (d 1 , b). Recall that d x only plays a role in the projection phase, and hence we can treat d 1 and b as tuning parameters, and estimate these two parameters using cross-validation (Feng et al. 2020). However, a usual cross-validation technique for time-series data, such as the expanding-window approach explained in Hyndman and Athanasopoulos (2018), does not apply to the proposed CSVAR model. The reason is that the expanding-window approach normally considers a short forecasting step, while the age-coherent feature proposed in the CSVAR model focuses the long-term forecast. Thus, we employ a hold-out-sample approach to select the tuning parameters. Formally, the following objective function is minimized: where RMSFE is the root of mean squared forecasting errors, and the evaluation period is given by the last fifth ([4T/5, T]) of the data in our study. 2 In summary, mortality forecasting with the proposed CSVAR model is generated with the following procedure. First, we derive the the optimal values of d 1 and b via a grid search based on the hold-out-sample cross validation. At this stage, the coefficients in the SVAR model,M andB, will be estimated using the ENET algorithm and the first 80% of data. Second, with the optimal values of d 1 and b, the SVAR coefficientsM and B will be estimated again using the full sample. The full-sample estimators will be used in the projection phase. Finally, after all parameters are estimated, the intercepts in the projection phase,m x,h s, are calculated, and future mortality rates are then projected using Equation (10). Besides the mean forecasts, a simulation strategy may be applied to investigate uncertainties in the projections. For instance, the prediction interval (PI) of the estimated life expectancies can be determined via simulation based on the normal distribution assumption of the residual ε t s. More specifically, we use method (I) described in Li (2014) to implement the simulation, where the 95% PI is composed of the 2.5th and 97.5th percentiles of the simulated results.

The Multi-Population Extension
We now consider a J-population case. Let y i,j,t be the mortality rate for age i of population j in year t. Further, y j,t be the N × 1 vector containing the mortality rates of population j in year t for j = 1, 2, ..., J, and y t be the JN × 1 vector containing the mortality rates of all populations at time t. The same procedure of estimating the single-population CSVAR model may be followed with two modifications. First,m x,h,j now follows the population-specific hyperbolic decay process: where the parameter d x,j is population dependent, andM * is the sample mean across all the J populations. Consequentially, each population has its own pair of parameters d 1,j and b j . Second, in the original SVAR model, the same penalty parameter λ is applied in the ENET algorithm and only one model is fitted. In this case the same setting used in the single-population CSVAR case can be applied. In particular, the CSVAR coefficients will be estimated from the following objective function: The tuning parameters λ, and (d 1,j , b j ), j = 1, 2, ..., J are then determined from minimizing the hold-out-sample RMSFE of mortality rates of the J populations: However, we may also consider a more flexible framework which allows for populationdependent penalty parameters λ j , j = 1, ..., J. More specifically, instead of estimating a universalB for all populations using the same penalty parameter λ, we may fit J pairs ofB j with the penalty parameter λ j for each population separately. In other words, the population-specific regression parameters (M j , B j ) are obtained by minimizing the objective function: After all the coefficient matrices are estimated, each pair of (λ j , d 1,j , b j ) will be determined to minimize the hold-out-sample RMSFE of mortality rates of population j only, i.e., Based on the estimation procedure above, it is rather efficient to generalize the agecoherent mortality projections to a large number of populations, since the parameters in each population-specific system are estimated separately and the number of parameters to be estimated will just increase linearly with the total number of observations in the data.

Empirical Analysis
This paper focuses on mortality data from the Human Mortality Database (2020). In particular, the proposed CSVAR model is illustrated using uni-sex, single-age mortality data of the United Kingdom (UK) and France from 1950 to 2016 and ages 0 to 100. The UK and French mortality data are chosen to illustrate the proposed CSVAR model because these two countries are both developed countries with similar socioeconomic conditions. The mortality of such countries are suitable to be modelled simultaneously as suggested by Li and Lee (2005). Further, the UK and French mortality data have been chosen to illustrate mortality forecasting models in the existing literature, including Chang and Shi (2020a) and Chang and Shi (2020b), among many others. The log central death rates are plotted in Figure 2 across all investigated years. Consistent improvements over time are observed for both countries. It can also be seen that there are relatively more pronounced mortality improvements of the oldest ages for the French population, when contrasting data of 1950 and 2016.

Long-Term Analysis
We first conduct the long-term analysis with the proposed CSVAR model using the entire sample period over 1950-2016. The projections generated by the Lee-Carter and the SVAR model are reported for comparison. First, Figure 3 displays the projected log mortality rates in 2100 generated by the three models, contrasted against the true rates in 2016. It is observed that, while projected mortality improvements are rather pronounced for the young and the middle ages, little improvements are gained at old ages for the forecasts produced by the LC and SVAR model, especially for the UK data. On the contrary, due to its age-coherent property, forecasts produced by the CSVAR model demonstrate much smoother mortality improvement across ages. In particular, the projected mortality improvements are more similar across ages-they are more pronounced for the old ages and less substantial at the younger ages than the other models. The more balanced projected mortality improvements are attributed to the decay inm x,h,j s. Next, we investigate the life expectancy projections, which is one of the most important applications in the mortality forecasting. In this paper, we take the life expectancy at birth (e 0 ) as an example, as it is a comprehensive measure which incorporates mortality information of all ages. As argued in Li et al. (2013), without a proper rotation ofb x in the Lee-Carter model, e 0 is likely to be underestimated, especially in the long run. Due to the lack of age coherence, this issue may also exist in the Lee-Carter and the SVAR model, as mortality improvements of the elderly could be underestimated. Figure 4 displays the mean forecasts and the prediction intervals of the life expectancy at birth (e 0 ) for both the UK and French population up to 2100. As for the point estimates, we can see that theê 0 s generated by the CSVAR model are uniformly larger than those of the Lee-Carter and the SVAR model. It is also worth noting that the SVAR model produces lower long-run forecast ofê 0 than the Lee-Carter model. More specifically, the point forecasts ofê 0 produced by the Lee-Carter, SVAR, and CSVAR model grow from 81.3, 81.2, and 81.2 (resp. 82.6, 82.6, and 82.6) as of 2017 to 90.6, 88.4 and 93.5 (resp. 93.7, 92.2, and 95.7) as of 2100 for the UK (resp. French) data, respectively. Overall, the higher projected life expectancies generated by the CSVAR model could be attributed to the more pronounced projected mortality improvement in the old ages, as shown in Figure 3. Additional conclusions can be drawn from the prediction intervals, which are generated based on the Gaussian-distributed temporal disturbances with 1000 simulated replicates. Despite the similarity of widths in early years, the lower bound of the 95% PI of the CSVAR model is higher than the point estimates of the Lee-Carter and SVAR (resp. SVAR) model for the UK (resp. French) data as of 2100, implying a significant difference. Furthermore, the widths of PIs produced by the SVAR model are wider than those of the Lee-Carter model after around 20-30 years, whereas the CSVAR model leads to the narrowest PIs for both populations. More specifically, for the UK (resp. French) data, the widths of 95% PIs of the Lee-Carter, SVAR and CSVAR model as of 2100 are 6.6, 8.0 and 3.9 (resp. 7.5, 10.1 and 5.7) years, respectively.

Out-of-Sample Forecast
Next, we consider the out-of-sample forecasting performance of the proposed CSVAR model. In particular, we compare its forecasting accuracy with those of the Lee-Cater model and the SVAR model. Following Li and Lu (2017) and Feng et al. (2020), the training sample is set to 1950-2000, and the remaining data is used as the test sample. The selected tuning parameters of the SVAR and CSVAR models are reported in Appendix A. For each model, we consider the RMSFEs over age groups and at individual time horizons separately and an overall measure as follows: RMSFE x (resp. RMSFE h ) is the RMSFE averaged over all 16 forecasting steps for age group x (resp. all the 101 ages at forecasting horizon h), and RMSFE all,h is the overall measure accounting for all ages and forecasting horizons up to h.
Firstly, RMSFE all,h and descriptive statistics of RMSFE x are reported in Table 1, where bold numbers indicate the smallest quantity for each statistic. The mean RMSFE x across all age groups of the CSVAR model is around 44% and 45% (resp. 5% and 6%) smaller than that resulting from the Lee-Carter model (resp. SVAR model) for UK and France, respectively. Moreover, Q 1 and Q 3 (the first and third quartiles) support that CSVAR model performs reasonably well among the three competing models, and standard deviation of RMSFE x confirms that the RMSFE x s of the CSVAR model are more narrowly spread than those of the Lee-Carter and the SVAR models. Finally, as indicated by RMSE all,16 , the overall performance of the proposed CSVAR model is better than the Lee-Carter and the SVAR model for both populations. Secondly, Figure 5 plots the RMSFE h at individual forecasting horizons ranging from 1 year (2001) to 16 years (2016). Distinct differences among all the three models can be observed, especially at larger horizons. In general, the VAR models consistently outperform the Lee-Carter model in all horizons for both populations. Furthermore, comparing with the SVAR model, RMSFE h of the CSVAR model is smaller in the majority of cases. Finally, with the growth of h (especially from the 8th step onward), the increment in RMSFE h is slower for the CSVAR model than Lee-Carter and the SVAR model, suggesting its better performance in the long run.

Robustness Analysis
To evaluate the robustness of the forecasting results, we perform the out-of-sample forecasting analysis under three major variant settings. Firstly, we follow Li et al. (2013) and Boonen and Li (2017) and model the logged death rates of the five-year ages instead of the single-year groups. Secondly, we consider a shorter training sample of 1970-2000. Finally, instead of crude death rates, we consider the smoothed rates using P-splines (Eilers and Marx 1996), as employed in Hyndman and Ullah (2007). We report the resulting RMSFE all,16 s in Table 2. It can be seen that the CSVAR model improves the forecasting results of the LC (resp. SVAR) model by at least 30% and 45% (resp. 7% and 8%) for the UK and French data, respectively. Therefore, we conclude that the proposed CSVAR model is able to produce more satisfying forecasting performance compared to the LC and SVAR model under different settings.

The Two-Population Extension
We now investigate the multi-population case by modelling the UK and French data jointly. As discussed at the end of Section 3.2, we consider the situations of a uniform penalty term (CSVAR 1 ) and individual penalty terms (CSVAR 2 ). The selected tuning parameters of the SVAR and the two CSVAR models are reported in Appendix A. The results of RMSFE x are summarised in Table 3, where the Li-Lee model is fitted to replace the Lee-Carter model for comparison. The following observations can be made. First, by jointly modelling the UK and French data, the forecasting results of the Li-Lee model are considerably improved over those of the Lee-Carter model displayed in Table 1. This, however, is not the case for SVAR and CSVAR 1 . Although their performances are comparatively better than those of the Li-Lee model for both populations, the forecasting results of the CSVAR 1 model and the SVAR model are less accurate than those of the single-population counterparts. For instance, the RMSFE all,16 of the CSVAR 1 model increase from 0.1106 and 0.1358 to 0.1159 and 0.1396 for UK and French data, respectively. One possible reason of the reduced accuracy in the two-population case is that the VAR model may produce less accurate forecasts when more irrelevant information is included (Feng and Shi 2018). Therefore, the application of a uniform penalty term may lead to the estimatedM andB that poorly reflect the historical mortality pattern of both populations. In contrast, when different penalty terms are allowed, the forecasting results of CSVAR 2 are almost uniformly better than those of the Li-Lee, SVAR, and CSVAR 1 model. Furthermore, compared to the single-population forecasts, CSVAR 2 leads to 23% improvement over the single-population CSVAR model of the French data (RMSFE all,16 of 0.1046 vs 0.1358). As for the UK data, the forecasts of CSVAR 2 are almost identical to those of the single-population CSVAR model. This may suggest that the UK mortality improvements may provide additional important information in the projection of the French improvements, but not vice versa. We now compare the out-of-sample forecasting accuracy of the CSVAR model, the SVAR model, and the Lee-Carter model in terms of life expectancy projection. Again, the life expectancy at birth is used as an illustration. We fit the three models using data from 1950 to 2000, and use the projected log mortality rates to produceê 0 over 2001-2016. The resulting mean forecasts and the 95% prediction intervals are then plotted against the true life expectancies in Figure 6. It can be seen that theê 0 s produced by the CSVAR model are uniformly larger than those generated by the Lee-Carter and the SVAR model, especially for the UK data. This is consistent with the age-coherent property of the CSVAR model. More importantly, those forecastê 0 by the CSVAR model are closest to the true values at all horizons for both populations. As for the PIs, all models manage to cover the range of the true e 0 . In terms of the efficiency, however, the VAR-based models generate narrower PIs than the Lee-Carter model for both populations. In particular, the CSVAR model results in the most efficient interval estimates, with the widths of the UK data almost 50% narrower than those of SVAR.

Conclusions
This paper proposes an age-coherent sparse vector autoregressive model to forecast log mortality rates. In particular, we allow the age-specific mortality improvement rates to converge to a universal long-term mean for all ages. The following key results can be drawn from our study. First, the proposed coherent VAR model generates more accurate out-of-sample forecasting results than the Lee-Carter model and the sparse VAR (SVAR) model recently developed by Guibert et al. (2019), as measured by the root-mean-square errors. This result holds for both the uni-sex, single-age UK and the French mortality data for age 0 to 100 with the training sample of 1950 to 2000 and the forecasting period of 2001 to 2016, and is robust when the training sample is shortened to 1970 to 2000 or when five-age mortality data are used instead of the single-age mortality rates. Second, the proposed coherent model retains the attractive advantages of the SVAR model, and has a more flexible parametric structure than the factor-based models, such as the Lee-Carter and the Li-Lee model. In particular, by extending the SVAR model, the proposed model uses the lasso-based algorithm to determine the autoregressive coefficient matrix, and thus the autoregressive matrix is data-driven, rather than based on a prior parameter constraints. Consequentially, the VAR model is able to capture rather general patterns of mortality developments, such as the impact of mortality change of a young age on that of a very old age. Thirdly, by allowing the mortality improvement rates to converge to a universal long-term mean, the proposed model can generate coherent long-term mortality projections, i.e., projected mortality rates at different ages will not diverge in the long-run, such as the spatial-temporal model by Li and Lu (2017). Moreover, by utilizing a hyperbolic decay structure, we allow the speed of convergence to vary with age. Finally, since the number of parameters to be estimated only increase rather moderately with the addition of new populations, the proposed model can be applied to model the joint mortality improvements of multiple populations in an efficient manner. In the multi-population case, projected mortality rates are coherent both on the age and the population dimension. A two-population illustration is made using the UK and French data.
In this paper, we focus on the age-coherent extension of the SVAR model developed by Guibert et al. (2019). However, the proposed age-coherent extension is rather general and applicable to more VAR specifications, such as VAR models with a moving average structure or stochastic volatility. In the future, it would be interesting to explore the age-coherent mortality projections based on more general VAR specifications. Acknowledgments: The authors are grateful to the University of Manitoba and Macquarie University for their support. We thank the three anonymous referees for their valuable comments on the earlier version of this paper. The usual disclaimer applies.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations and variables are used in this manuscript: The average mortality level at each age x k t The mortality index at time t b x The age-specific sensitivity of y x,t to changes in k t ε x,t The normal error term Y t The vector of differenced log central mortality rate M Intercept vector of the VAR-type models B Coefficients of Y t−1 in the VAR-type modelŝ m x,h Forecast intercept term in the CSVAR model for age x at step h The hyperbolic parameter associated withm x,h λ The ENET penalty Additional variables of joint population models: B x Age effect of the common factor K t Period effect of the common factor