A Two-Population Extension of the Exponential Smoothing State Space Model with a Smoothing Penalisation Scheme

The joint modelling of mortality rates for multiple populations has gained increasing popularity in areas such as government planning and insurance pricing. Sub-groups of a population often preserve similar mortality features with short-term deviations from the common trend. Recent studies indicate that the exponential smoothing state space (ETS) model can produce outstanding prediction performance, while it fails to guarantee the consistency across neighbouring ages. Apart from that, single-population models such as the famous Lee-Carter (LC) may produce divergent forecasts between different populations in the long run and thus lack the property of the so-called coherence. This study extends the original ETS model to a two-population version (2-ETS) and imposes a smoothing penalisation scheme to reduce inconsistency of forecasts across adjacent ages. The exponential smoothing parameters in the 2-ETS model are fitted by a Fourier functional form to reduce dimensionality and thus improve estimation efficiency. We evaluate the performance of the proposed model via an empirical study using Australian female and male population data. Our results demonstrate the superiority of the 2-ETS model over the LC and ETS as well as two multi-population methods - the augmented common factor model (LL) and coherent functional data model (CFDM) regarding forecast accuracy and coherence.


Introduction
Continual improvements in human life expectancies over the past few decades have brought a serious challenge to the prediction of future mortality scenarios. Mortality forecasts are crucial not only in demography but also in many other relevant areas. Accurate forecasts are therefore essential to government planning, designing of pension schemes and annuity products and the reserving for insurance companies.
Actuaries and researchers have developed various models to describe and predict features of mortality reductions. One of the most famous models is the Lee-Carter (LC) (Lee and Carter 1992) model belonging to the extrapolative family whose members produce predictions by assuming the continuity of past patterns. Many developments and extensions have been proposed to the single-population LC model. For example, Renshaw and Haberman (2006) incorporate an additional cohort factor to capture the pattern related to the year of birth. Li and Lee (2005) develop a multi-population version of the LC which is referred to as the augmented common factor model (LL).
Although the LC model receives criticisms for its insufficient allowance for potential volatility in mortality forecasts (see, for example, Wong et al. 2020), it has been regarded as a benchmark in various studies. For instance, Feng and Shi (2018) adopt the exponential smoothing state space (ETS) model 1 to predict mortality rates and compare its performance with those under the LC, functional data model (FDM) as well as some univariate time series processes. Thereinto, the ETS model turns out to be the best-performing choice based on Australian population data. According to Makridakis and Hibon (2000), the ETS model also presents outstanding results in the M3-competition. However, fitting a single-population ETS model without constraints/penalties may be incapable of ensuring the coherence, which is important in long-run forecast of mortality rates (Li and Lu 2017;Li 2013;Li and Lee 2005). As indicated by our empirical studies, the mortality forecasts of the single-population ETS model suffer from the limitation that rates of adjacent ages may be inconsistent with one another in the long run. In other words, it is possible to generate significant fluctuations for certain age groups, which can cause problems when using such forecasts to price annuities and mortality-linked securities. Furthermore, in the case of modelling multiple populations, single-population models such as the LC and ETS cannot ensure consistency between populations, and hence lose the critical property of coherence. It would be more desirable to perform a joint modelling of two or more related groups and integrate their relationships into mortality forecasts. For example, it is biologically unreasonable to predict that future mortality rates of males and females in the same country will diverge over time.
Our study overcomes the above issues of the original ETS model by imposing a smoothing penalisation scheme as described in Li and Lu (2017) and extending it to a two-population ETS model . Under the proposed model, the rates of mortality changes for sub-populations under investigation are associated with each other, enabling coherent forecasts for the whole group. More specifically, the smoothness across adjacent ages is guaranteed by setting parameters which minimise the sum of squared differences of mortality changes between neighbouring ages. However, the 2-ETS model involves hundreds of parameters and is difficult to estimate because no close-form solutions are available from its iterative identification procedure. To improve the fitting efficiency, we employ the Fourier dimensionality reduction technique. In particular, a Fourier functional form is fitted to each of the exponential smoothing parameters in the 2-ETS model, so that the original group of unknown parameters is reduced to a dozen of Fourier factors.
To examine the performance of the 2-ETS model, we compare its prediction results with those under the benchmark LC model and the original ETS model. Besides these two single-population candidates, the multi-population extensions of LC and FDM -the LL and coherent functional data model (CFDM) developed by Hyndman et al. (2012) are added to the comparison list. Using Australian female and male population data over 1950-2016 and ages 0-100, we demonstrate the superiority of the proposed 2-ETS model over the other candidates under various scenarios. Based on simulated replicates with multi-Gaussian distributed residuals, the prediction intervals (PIs) also accurately capture the true data, when mortality rates averaged over all ages are used.
In summary, this paper develops a two-population ETS model with a smoothing penalisation scheme and compares its performance with other popular alternatives. The proposed model ensures the desirable coherence property and can improve the superior forecasting results of the original single-population ETS model. The remaining of the article is structured as follows. Section 2 reviews specifications of the LC, ETS, LL and CFDM models. Section 3 specifies the 2-ETS model and describes the fitting procedure. An empirical study comparing the five mortality models is reported in Section 4. Finally, Section 5 gives concluding remarks and possible directions for future research. 1 See Hyndman et al. (2002) for a thorough review of exponential smoothing methods.

The Lee-Carter Model
The Lee-Carter (LC) model is proposed by Lee and Carter (1992). It expresses the log central mortality rate at age x in year t as where a x is the average mortality level at each age, k t is the mortality index at time t, b x represents the age-specific sensitivity of ln m x,t to changes in k t , and ε x,t is the error term with null mean. Since the right-hand side parameters are not observable, they are estimated by singular value decomposition (SVD) instead of the usual ordinary least square approach in the original paper. 2 To avoid the identification problem, two constraints ∑ t k t = 0 and ∑ x b x = 1 are imposed. As implied by the first constraint, the age effect a x is set to the mean of log central death rates across years. Given the estimated a x and b x , k t is adjusted to match the fitted total number of deaths to the observed values in each year t. The reconciliation rebalances the equal contribution by mortality at all ages by assigning greater weights to ages at which death counts are larger. Under the LC model, the two age-specific parameters are assumed to remain unchanged over time, and the mortality index is often modelled by a random walk with drift as follows: where the drift term d measures the average annual change in k t , and e t ∼ N(0, σ 2 e ). As suggested by Giacometti et al. (2012), the expected h-step-ahead forecast of the mortality index and the log central death rate can be calculated as:k where T is the end of the fitting period.

Exponential Smoothing State Space (ETS) Model
One popular category of forecasting models is called exponential smoothing model under which forecasts are produced as a weighted sum of past values. Members of this family assign exponentially decaying weights to observations further into the past rather than using a simple average . Pegels (1969) proposes a way to classify ETS models according to the combination of various types of error, trend and seasonal components involved in the model. This list has been extended to thirty distinct ETS models by employing additive/multiplicative error/trend/seasonality components. Thereinto, a 'damped' type can be added to characteristics of the trend component, implying a flattened trend of predictions (Gardner and Mckenzie 1985). For instance,  introduces an ETS model with an additive damped trend, which is then modified by Taylor (2003) to a multiplicative one. Besides, it has been shown that exponential smoothing models can be expressed as innovations state space models (Hyndman et al. 2002(Hyndman et al. , 2005. Detailed model specifications can be found in Section 2 of Hyndman and Khandakar (2008).
Nevertheless, ETS models with seasonal components are not applicable to our study because seasonality is not present in mortality forecasting. In addition, Feng and Shi (2018) suggest that only 2 A maximum likelihood method may also be employed to calibrate the parameters (Renshaw and Haberman 2003). two ETS models (with additive (damped) trend and additive error terms) are possibly suitable for modelling mortality rates. We do not consider the ETS model with damped trend in this paper. 3 Expression of the only appropriate ETS specification (also known as the Holt-Winters model) is described as follows: where l x,t and b x,t represent the level and growth of ln m x,t , respectively. Their corresponding exponential smoothing parameters α x and β x can be computed by minimising ∑ x,t ε 2 x,t , but no close-form solutions are available from the iterative estimation procedure. The h-step-ahead forecast of the log mortality rate is lnm x,T+h = l x,T + hb x,T , where T is the end of the fitting period. When modelling mortality of multiple populations, the above two single-population models may fail to ensure coherence. For example, separate forecasts for female and male mortality generated from single-population models may diverge over time. A more formal discussion of the coherence can be found in Section 3.1. To ensure this desirable feature, we also consider two popular multi-population models.

The Augmented Common Factor, or Lee-Li (LL) Model
Li and Lee (2005) extend the Lee-Carter model by introducing an additional common factor which controls the relationships between populations. Specifically, the log central death rate is modelled as: where a x,i represents the average of the age-specific mortality level for the ith population, B x and K t represent the age effect and period effect of the common factor, k t,i is the time component of the ith population with age response b x,i , and ε x,t,i is the population-specific error term. The common factor B x K t describes the mortality trend of all populations. In the original work of Li and Lee (2005), it is estimated from applying the LC method to the total population, subject to constraints ∑ t K t = 0 and ∑ x B x = 1. Then a x,i is obtained by minimising the modelling error of each subpopulation ∑ t (ln m x,t,i − a x,i − B x K t ) 2 at age x. Implied by the constraint on K t , a x,i is taken as the average of ln m x,t,i over t. The population-specific factor b x,i k t,i can be estimated by applying SVD to the residual matrix (ln m x,t,i − a x,i − B x K t ).
Similar to the case under LC, the common mortality index K t can be modelled as a random walk with drift process. On the other hand, the group-specific time component k t,i is fitted by a stationary autoregressive process to ensure coherent forecasts in the long term. Specifically, where α 0,i and α 1,i are the autoregressive parameters and e t,i is the Gaussian error term with null mean. The stationarity guarantees that deviations of each population from the common trend will not 3 In our preliminary analysis, all damped parameters essentially approach 1 after a penalised structure is considered as in (14).
continue in the long run. Given the data observed in the last year T, the h-step-ahead forecast of the log central death rate is given as follows: Hyndman et al. (2012) propose a mortality model with coherent forecasting, which is developed from the single-population functional data model (Hyndman and Shahid Ullah 2007). Instead of working on mortality rates directly, the coherent functional data model (CFDM) predicts the product and ratio functions of mortality rates for different groups. Considering the case with I populations, the product and ratio functions are given as where m x,t,i is the central death rate of population i (i = 1, 2, . . . , I). Therefore, the CFDM is also referred to as the product-ratio model, which can be expressed as: where µ x,p and µ x,r,i are the average of ln p x,t and ln r x,t,i across years, ε x,t and ε x,t,i are serially uncorrelated error terms with zero mean, and the principal factors β x,j , γ x,g,i and their corresponding component scores φ t,j , ψ t,g,i are obtained using the weighted principal components analysis (Hyndman and Shang 2009). This fitting technique assigns higher weights to more recent data, which avoids the problem of potential time-varying age components (Lee and Miller 2001).
Both the number of principal factors for product and ratio functions are set to be 6 (J = G = 6) which is the optimal choice balancing forecast accuracy and parameter parsimony (Hyndman et al. 2012). Those time-varying components of the product function govern the main trend of future mortality rates and are forecasted by non-stationary processes. Nonetheless, stationarity is required in modelling the period effects for the ratio function to ensure the non-divergence of mortality projections. The h-step-ahead forecast of log central death rates for each subpopulation can be calculated as where T is the end of the fitting period, µ x,i = µ x,p + µ x,r,i . While the prediction function is similar to that under the LL model, the CFDM model adopts six components for the common and population-specific factors rather than one.

The Two-Population ETS Model
Compared with a single-population model, the most outstanding merit of a multi-population model is the characteristic of coherence, which is defined as follows (Li and Lee 2005).

Definition 1.
Coherence means that the forecasts of ln m x,t,i and ln m x,t,j will not diverge for the mortality rate of the x-year-old of populations i and j, when t → ∞. Li and Lee (2005) and Hyndman et al. (2012), respectively, the forecasts produced by LL and CFDM models are coherent.

Remark 1. As argued in
Despite the outstanding forecasting performance of the ETS model presented in Feng and Shi (2018), the original ETS model is not feasible for multi-population modelling. In this section, we propose a two-population ETS model and demonstrate the existence of coherence in this framework.

Model Specification
In the original ETS model, it is worth noting from (5) that when h is large (indicating long-term forecasts), lnm x,T+h will be dominated by b x,T+h . It is because l x,T is not changing with h and is therefore o(h). Furthermore, the growth equation of (4) indicates that which is a random walk without drift and thus an I(1) process. Therefore, within a multivariate (vectorized) framework, we will adopt the idea of co-integration. A related structure can be found in Li and Lu (2017), for which a two-population ETS (2-ETS) model can be specified as follows.
where β * x,i = β x,i α x,i , i = 1, 2, and −i =1 (2) when i =2 (1). The forecasting equations under the 2-ETS model are more complex than those produced in (4), which can be iteratively derived using Theorem 1. Given that all α x,i , β x,i and γ x,i fall in (0,1) for all ages x and i = 1, 2, and ε x,t,i follows a multi-Gaussian distribution with means 0 and covariance matrix Σ i for each i = 1, 2, mortality rates forecasted by the 2-ETS model described in (12) are coherent.
Consequently, using (13) we have that when t → ∞. Thus, the ratiom x,T+h,1 /m x,T+h,2 will converge to a constant at each age and the death ratesm x,T+h,1 andm x,T+h,2 will not diverge in the long run, which completes the proof.
Remark 2. The assumptions of the 2-ETS model are all standard and not strong. For example, α x,i , β x,i ∈ (0, 1) is directly adopted from the single-population ETS model. γ x,i ∈ (0, 1) is an analogous extension.
The assumption of multi-Gaussian disturbances is popularly employed in the existing literature, such as Lee and Carter (1992), Hyndman et al. (2012) and Li and Lu (2017).
In addition to the coherence among populations, smoothness across neighboring age groups is also of interest in mortality forecasting. Thus, in terms of the estimation, we follow the smoothing penalisation scheme of Li and Lu (2017) where age groups range from 0 to 100, and λ 1 and λ 2 are the known non-negative tuning parameters for populations 1 and 2, respectively. If both λ's are 0, the estimation reduces to an unpenalised optimisation problem. The larger the λ's are, the smoother the resulting forecasts will be.

Reduction of Dimensionality
Despite the desirable coherence and smoothness, the 2-ETS model described above is difficult to calibrate. To see this, the equation of each age has six free parameters (α x,i , β x,i and γ x,i , for i = 1, 2). The total number of free parameters can be over six hundred, with age groups of 0-100. As no close-form solution is available, the estimation efficiency may be questionable without using a dimensionality reduction technique.
As argued in Li and Lu (2017), the fitted coefficients of all parameters should change smoothly for adjacent ages. To see this, for the smoothed Australian females and males mortality rates, we firstly fit an unpenalised 2-ETS model. The included ages are from 0 to 100, and the sample period is 1950-2006. The resultingα x,i ,β x,i andγ x,i are plotted in Figure 1 (for females) and Figure 2 (for males) as scatter dots. 4 In contrast to Li and Lu (2017), we do not penalise α x,i , β x,i and γ x,i . One reason is that those parameters will be smoothed after applying the procedure described in Section 3.2. The other reason is that out-of-sample forecasts of ln m x,t,i do not directly depend on them. In other words, smoothed parameters will not necessarily enforce the smoothness of b x,T,i across x. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q For both females and males, consistent with Li and Lu (2017), all the fitted parameters demonstrate certain smoothed patterns between neighbouring age groups. Thus, the dimensionality can be largely reduced, if we assume thatα x,i ,β x,i andγ x,i follow some simple parametric smoothed functions of the age x. A possibility is to adopt an Fourier flexible functional form as follows: where the subscript refers to the parameter concerned and n α i , n β i and n γ i determine the smoothness of each parameter. The smaller they are, the smoother the variations of those parameters across adjacent age groups will be. To select an optimal number, one needs to balance the parsimony and accuracy. However, it is worth noting that a high-level accuracy (precisely match the structures of the raw estimates) is not desirable. For one thing, the raw estimates are obtained before applying the penalty scheme. Hence, according to Li and Lu (2017), given the limited data availability, estimates of an unpenalized model is of a more random nature. Upon the implementation of a penalty scheme, those patterns as shown by the scatter dots in Figures 1 and 2 are expected to change and to be smoother (simpler). For another, as shown in (15), for larger n α i , n β i and n γ i , the corresponding models nest those of smaller numbers of trigonometric pairs. Consequently, if a 2-ETS model with simpler parametric structures can produce satisfactory forecasting results, those with larger n α i , n β i and n γ i are at least not expected to underperform the nested model. Based on the above rationales, we select n α i , n β i and n γ i as the smallest integers, such that the R 2 of the corresponding linear regression is over 50%.
The fitted results are also demonstrated in Figures 1 and 2 as solid lines, which overall well represent the general structures of α x,i , β x,i and γ x,i . The optimal n α 1 , n β 1 , n γ 1 , n α 2 , n β 2 and n γ 2 are 2, 3, 6, 3, 4 and 5, respectively. Thus, the total number of free parameters can be reduced from 606 to 52, which is over 90% smaller. More specifically, instead of estimating α x,i , β x,i and γ x,i directly, given predetermined n α i , n β i and n γ i , we can estimate the intercepts and slopes included in (15) to obtainα x,i , β x,i andγ x,i which then minimise Equation (14). The reduction of dimensionality is critical to tunning parameter selection, for which the procedure is computational intensive with the optimisation being performed repeatedly.

Selection of the Tuning Parameter
To select the tuning parameters λ 1 and λ 2 , we employ the procedure discussed in Hyndman and Athanasopoulos (2018) to perform the cross-validation for time series, which is also known as 'evaluation on a rolling forecasting origin.' The basic algorithm is explained below: 1. Identify the first training set (e.g., ln m x,1,i ,ln m x,2,i ,. . . ,ln m x,0.75T,i ) out of the the entire sample; 2. Given λ 1 and λ 2 , use the training set to fit the 2-ETS model and obtain the 1-step-ahead forecast lnm x,0.75T+1,i ; 3. Extend the training set to include ln m x,0.75T+1,i and refit the 2-ETS model to obtain the 1-step-ahead forecast lnm x,0.75T+2,i ; 4. Repeat steps 2-3 until lnm x,T,i is generated; and 5. Calculate the root of mean squared error (RMSE) as λ 1 and λ 2 are then chosen as those resulting in the smallest RMSE.

Overall Fitting Procedure
Now we consider the entire fitting process, by combining the procedures of dimensionality reduction and tuning parameter selection. The overall fitting procedure of the 2-ETS model is explained below: 1. Fit an unpenalised 2-ETS model to obtainα x,i ,β x,i andγ x,i ; 2. Select n α i , n β i and n γ i as described in Section 3.2; 3. Given the chosen n α i , n β i and n γ i , select the tuning parameters λ 1 and λ 2 as described in Section 3.3; and 4. Use the selected n's and λ's with (15) to minimise (14).
Forecasts of mortality rates can then be produced using the model as fitted above. The associated prediction intervals (PIs), can be produced via simulations based on the multi-Gaussian errors. The Σ i can be computed as the sample covariances ofε x,t,i given the obtained estimates of parameters.

Empirical Analysis
We have collected mortality data of Australian female and male populations aged 0-100 between 1950 and 2016 from the Human Mortality Database (2020). The starting year is chosen as that investigated in Booth et al. (2006) and Hyndman et al. (2012) to obtain a complete and relevant dataset. Figure 3 displays the age-specific log death rates over the sample period. It can be seen that Australian Females and males both exhibit continual mortality improvements, while some distinctions exist. For example, the decrease of male death rates at around age 20 (accident hump) has been more rapid than that for females in recent years. Multi-population models may be able to capture those similarities and differences between the two populations. We compare the forecasting performance between the LC, ETS, 2-ETS, LL, and CFDM models using the 10-step-ahead projection, with a training set of 1950-2006. Then, the predictions are compared against observed (true) values to assess forecast accuracy. Female and male data are modelled separately (jointly) under the single-population (multi-population) models.

Forecast Accuracy Comparison
The forecast accuracy of the mortality models is examined by the RMSE at age x, forecasting step h and as a total measure across age groups and time horizons as follows.
where RMSE x,i (RMSE h,i ) is the root mean squared error at age x (forecasting step h) across 10 prediction steps (101 ages) for population i, RMSE all,h,i is a two-dimensional criterion measuring forecast error over all age groups and time horizons up to h. Figure 4 plots the RMSE x,i against age. A summary of RMSE values computed across ages is presented in Table 1. As indicated, the LC model tends to produce the least accuracy at most ages for both genders, whereas no single model uniformly beats the rest. More specifically, all the models except 2-ETS show some peaks (abnormally large RMSE values) at age groups of around 20 for female population, and the forecast error at around age 12 under all the five candidates present a significant peak. For males, besides the unusually large RMSE at age 20, LC and CFDM exhibit a peak at age 60. In general, the two single-population models and CFDM tend to produce large RMSEs at certain ages. The curves of LL and 2-ETS show similar shapes, whereas our 2-ETS model clearly outperforms all the other competing models over ages 15-30.
One advantage of 2-ETS is that it does not produce abnormally large RMSE, which is shown by its standard deviation in Table 1, being the smallest among all the models. The first column in Table 1 gives the overall measure of the forecast accuracy. It is interesting to see that the three multi-population models outperform the two single-population models for both genders (except under CFDM for males). More specifically, the best-performing model is 2-ETS, followed by LL, and LC tends to produce the least accurate predictions. The results of ETS and CFDM are fairly close to each other, though CFDM (ETS) tends to predict female (male) population more accurately. The above relationships also hold for RMSEs averaged over age groups. In general, all the statistics except the first quartile Q 1 advocate the newly proposed 2-ETS model for both genders. The superiority of the 2-ETS over the rest is more obvious for males.  Figure 4. RMSE x,i plotted against age groups for Australian mortality data. Note: RMSE all,10,i is the overall measure across all ages and forecasting steps for population i. The columns Mean, Std. Dev., Q 1 and Q 3 display the sample mean, standard deviation, first and third quartiles of RMSE x,i calculated over age groups, respectively. The minimum value of each statistic among the five models is presented in bold.
We now consider the prediction results over time horizons. The two-dimensional measure RMSE all,h,i against forecasting horizon h is plotted in Figure 5. Among the five candidates, LC is the worst-performing model with notably the highest forecast errors, and the differences become even more evident for Australian males. Unlike the earlier observations, the multi-population models do not consistently beat the single-population ETS. For instance, the LL curve lies above the ETS curve before a crossover at around step 5 for the two populations. Nevertheless, our 2-ETS almost consistently outperforms the other competing models, especially for males. The individual RMSE h,i values at each forecasting step are summarised in Table 2. Consistent with our observations in Figure 5, the 2-ETS model produces the smallest RMSE consistently for males and leads to the 6 out of 10 minimum RMSE h,i for females.   It is worth investigating the desirable smoothness of the 2-ETS model with the empirical data. Figure 6 plots the projected and observed mortality rates for Australian females and males in 2016. The results of the single-population models are given in the top panel. The ETS curve shows more irregularities over neighbouring ages for both genders. In comparison, the predicted values under the 2-ETS model are not only much more smoothed out over neighbouring ages but also closer to the observed values. Furthermore, the LC model tends to over-estimate (under-estimate) the mortality rates for females aged 20-30 (30-60) and for males aged 20-40 (5-15 and 40-60). The multi-population models (bottom panel) seem to produce similar levels of forecasts and tend to outperform the two single-population candidates. Among the three multi-population models, 2-ETS clearly beats LL and CFDM over age range 15-30, whereas performances of the three are similar for the older populations. Overall, it can be concluded that the proposed 2-ETS model predicts the Australian mortality rates in 2016 reasonably well. The smoothness over adjacent ages is also observed.

Prediction Intervals via Simulation
We now evaluate the interval forecasts of the 2-ETS model via simulation, as briefly discussed in the end of Section 3.4. The simulation procedure is summarised as follows.
1. Given the in-sample period 1950-2006, we estimate the model parameters and calculate the fitted (log) central death rates lnm x,t,i ; 2. The 57 × 101 residuals are then collected asε x,t,i = ln m x,t,i − lnm x,t,i , which are assumed to follow a multi-Gaussian distribution with means 0 and covariance computed as sample values from usingε x,t,i ; 3. Given the assumed distribution, simulate a 10 × 101 matrix of error terms, which is applied to the 2-ETS projections from 2007 to 2016, according to (12); and 4. The process is repeated until 5000 replicates are produced. To sum up, with a 10-year out-of-sample period, we demonstrated the outperformance of the proposed 2-ETS model over the existing models. Its smoothness is also present in the scenario of h = 10 (2016). In the next section, we further explore the coherence and smoothness of the 2-ETS from a long-term forecasting perspective, and compare its performance with the other four competing models.

Long-Term Forecasting Performance
To investigate the long-term performance of the five candidates, we obtained projections up to 2050 based on the full sample . The results are plotted in Figure 8. The curves of the two single-population models exhibit some deviations from those of the multi-population counterparts. Firstly, under the LC model, there is a significant accident hump in 2050 for female population only. Forecast curves of the other models do not have such a deep hump. Furthermore, female mortality improvements forecasted by the LC tend to be smaller than those produced by the multi-population models over ages 30-60. This is less evident when males data are analysed. When the ETS model is adopted, as expected and being consistent with Figure 6, significant irregularities over neighbouring ages are evident for both genders. Such irregularities are not observed in the case of 2-ETS model, indicating its improved smoothness across ages. Among the three multi-population candidates, CFDM tends to produce the lowest (highest) rates for the youngest (oldest) 15-year age group for both genders. The 2-ETS curve lies above the other two over age range 40-80. Apart from that, some sex-specific differences are also present. For example, the predicted mortality rates under LL for Australian females aged 5-15 are much lower than those of 2-ETS and CFDM.
Following Li (2013), we examine the coherence of mortality forecasts between sexes by plotting the male-to-female ratios from 1990 to 2050. The observed (predicted) mortality rates are averaged over each of the three age groups: 0-29, 30-59, and 60-100, then the mean values of the male population are divided by those of the female population to obtain the corresponding ratios. As indicated in Figure 9, the three multi-population models produce convergent ratios in the long run for all age intervals, which is not the case when single-population models are applied. For instance, the male-to-female ratios of the youngest group under the LC model and the middle age group under the ETS model show a decreasing trend, which potentially causes the crossover problem of mortality forecasts between genders.
In conclusion, without considering coherence between populations and smoothness across ages, single-population models would perform differently from multi-population models in the long-run. In particular, the single-population ETS model produces undesirable divergent mortality forecasts, which can be largely avoided when the 2-ETS model is employed. Considering the results discussed in Sections 4.1 and 4.2, we can conclude that the 2-ETS model is the best performing model which also effectively achieves coherence and smoothness, when the Australian female and male mortality data are examined.

Conclusions
This research proposes a 2-ETS model with smoothing penalisation scheme and demonstrates its coherence property in mortality forecasting. Using an effective dimensionality reduction technique, we evaluate the out-of-sample forecasting accuracy of 2-ETS based on the Australian female and male mortality data. Two single-population models LC and ETS, and two multi-population models LL and CFDM are also tested and compared with the proposed candidate. Our analysis demonstrates that the 2-ETS model tends to produce less large forecast errors at different age groups (measured by RMSEs) when compared to the other candidates. For different forecasting horizons, the 2-ETS model almost consistently leads to smaller forecast errors than the others, especially for Australian males. The superiority of our proposed model is further demonstrated by the overall accuracy measure considering both age and time dimensions. We then construct the associated PIs via a simulation study based on the multivariate Gaussian assumption of error terms. In general, the multi-population models tend to outperform the single-population candidates regarding prediction accuracy. Although the original ETS model produces satisfactory RMSEs, it suffers from a shortcoming of fluctuating forecasts across adjacent ages and divergent forecasts between genders. From the 10-step-ahead and long-term projections, we can observe that the proposed 2-ETS model overcomes the above problems. Mortality forecasts under the new model are coherent between males and females in the long run and are smoothed over neighbouring ages.
There are several directions for future study. Firstly, the 2-ETS model may be extended to cater for co-modelling of three or more sub-populations of a group in practice. For example, the joint projection of state-level data would be useful for government planning such as social benefits and superannuations. Secondly, the model may be applied or modified to investigate the evolution of age patterns in mortality data by fixing the time effect and forecast in the age dimension. Moreover, the ETS specification does not consider mortality improvements linked to the year of birth. Either a common or population-specific cohort factor may be added to the model structure, but further research is needed. Other approaches to identify parameter estimates and to reduce the dimensionality may also be performed in future research.
Author Contributions: Methodology, Y.S. and J.L.; formal analysis, Y.S. and S.T.; writing-original draft preparation, Y.S. and S.T.; writing-review and editing, J.L.; visualization, S.T. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.