LSTM-Based Coherent Mortality Forecasting for Developing Countries

: This paper studies a long short-term memory (LSTM)-based coherent mortality forecasting method for developing countries or regions. Many of such developing countries have experienced a rapid mortality decline over the past few decades. However, their recent mortality development trend is not necessarily driven by the same factors as their long-term behavior. Hence, we propose a time-varying mortality forecasting model based on the life expectancy and lifespan disparity gap between these developing countries and a selected benchmark group. Here, the mortality improvement trend for developing countries is expected to converge gradually to that of the benchmark group during the projection phase. More specifically, we use a unified deep neural network model with LSTM architecture to project the life expectancy and lifespan disparity difference, which further controls the rotation of the time-varying weight parameters in the model. This approach is applied to three developing countries and three developing regions. The empirical results show that this LSTM-based coherent forecasting method outperforms classical methods, especially for the long-term projections of mortality rates in developing countries.


Introduction
In the last few decades, human mortality has improved significantly in several countries, especially in developing regions.These mortality reductions can generate important longevity risks for life insurance companies and pension schemes.The study of these longevity improvements is fundamental in life insurance and annuity research in actuarial science literature.
There are statistical techniques for the forecasting of future mortality.A popular method, proposed by Lee and Carter (1992), is the so-called Lee-Carter (LC) model, where the log force of mortality ln(m x,t ) is represented as the sum of an age component a x plus the product of an age-specific function b x and a time component k t .Obviously, such a model cannot be fitted as a regular regression model because of the product of parameter terms.In work by Lee and Carter (1992), a two-step method was applied, where singular value decomposition (SVD) is used to fit the model, followed by an autoregressive component, or random walk, to forecast the time component k t .
In the last two decades, the LC method has been used in the practice frequently, and several papers propose various extensions of the method; interested readers can refer to Lee (2000), Pitacco (2004), Wong-Fupuy and Haberman (2004), as well as the Cairns-Blake-Dowd (CBD) model (Cairns et al. 2006), and references therein.This literature has shown how the original LC method can lack flexibility with regard to the effect of age.Renshaw and Haberman (2003) extended the LC method to a multi-factor version adding an agespecific enhancement.In addition, a single-factor model with cohort effects is proposed by Renshaw and Haberman (2006).For a comprehensive review of the early literature on these various forecasting methods, refer to Cairns et al. (2008) and Cairns et al. (2011a).
Note that apart from the LC method and its extensions, another approach developed in the literature for mortality modeling is based on generalized linear models (GLMs); for example, see Brouhns et al. (2002), Renshaw andHaberman (2006), andO'Hare andLi (2012).For a comprehensive survey on fitting GLMs to mortality data, refer to Currie (2016).In addition, the Bayesian approach appears in the literature on mortality modeling.For example, Czado et al. (2005) and Pedroza (2006) extend the LC model to Bayesian analyses using Markov chain Monte Carlo (MCMC) methods.Cairns et al. (2011b) further extended Bayesian stochastic mortality modeling to two populations; see Antonio et al. (2015) for an application of a Bayesian method under multiple populations.For more recent studies on mortality modeling with a Bayesian approach, refer to Li and Lu (2018), Li et al. (2019), and Wong et al. (2023), and references therein.
The LC method, and many of its early extensions, focuses on a single population (for example, modeling only one gender at a time or combined genders, and only one country).In particular, let m x,t denote the mortality rate of age x at time t for t = 0, 1, 2 . . ., T and x = 0, 1, 2 . . .ω for a given population.Then, let a x denote the mean log mortality rate over time, that is, a x = 1 T+1 ∑ T t=0 ln m x,t .Using the LC method, one obtains ln m x,t = a x + b x k t + ϵ x,t , where ϵ x,t is the mean zero random noise, and k t is modeled using random walk with drift: The literature shows that it is difficult for the LC method to forecast mortality rates for two genders at the same time in one population or in multiple populations and regions, where a certain divergence could be reached due to the differences in b x and d (and in turn, different k t ) in the model.For example, Carter and Lee (1992) suggested using the same k t , but gender-specific b x values to forecast male and female mortality rates separately in the U.S. Lee and Nault (1993) used the same k t and b x for mortality forecasting in each province of Canada, which works only when the b x values of different provinces, as obtained from historical data, are similar.
However, some early discussions on the convergence of life expectancy around the world (see, e.g., White 2002, Wilmoth 1998, and Vaupel and Schnabel 2004) show that there is convergence in long-term life expectancy, diverging forecasts of mortality rates for different populations in a group of countries is unrealistic.Therefore, Li and Lee (2005) introduced the so-called coherent extension of the LC method for mortality forecasting of a group of populations (we call it the Li-Lee method), where the log mortality rates for each member in the group are decomposed into three parts, namely, member-specific a i,x , common age and period effects B x and K t , and member-specific age and period effects, b i,x and k i,t .More precisely, the log mortality rates ln(m i,x,t ), for member i in the group at age x and time t, can be expressed as ln m i,x,t = a i,x + B x K t + b i,x k i,t + ε i,x,t , (3) where a i,x measures the average mortality level at age x in country i.K t is the common period effect for all countries and is modeled by a random walk with drift d 0 .B x is the common age effect, i.e., the common mortality sensitivity at age x, with respect to K t .In addition, k i,t and b i,x are the country-specific period and age effects, respectively, which measure the fluctuations around the common mortality patterns in the group for country i.Finally, ε i,x,t , ν t , and ϵ i,t are normally distributed i.i.d.errors.It turns out that the additional information provided by similar members/countries in the group can improve the forecast accuracy for individual countries.On the other hand, according to Hanewald (2011), there is a strong long-term connection between the mortality dynamics and the gross domestic product (GDP) per capita and unemployment rate in a country, which points to the essential difference of the mortality improvements between developed countries and developing countries during the same period of time.Based on such an observation, Niu and Melenberg (2014) improved the LC model with an extra factor (namely, GDP) describing the economic growth; see Boonen and Li (2017) for the study under multiple populations.More recently, Ma and Boonen (2023) further argued that the consumer price index (CPI) is a more suitable factor, added to the LC model, to explain the mortality trends in a country.It explains the affordability of healthcare, food, and housing in that country.The above-mentioned studies verify, from a different point of view, that life expectancy (or any other similar indexes) can play an important role in mortality forecasting.
Moreover, as mentioned by Li et al. (2013), mortality decline decelerates in younger ages and accelerates at older ages in many developed countries.Such a "rotation" can generate problems in the results of long-term mortality projections using the LC method for developing countries that do not exhibit such a subtle rotation in their historical data, e.g., the projected mortality rates are low for younger ages.Hence, Li et al. (2013) developed a rotation-based LC method, where the out-of-sample b x was assumed to be converging to an ultimate structure based on the development of life expectancy.But, as argued by Li and Lu (2017), mortality rates should change smoothly and continuously across ages; such a problem, known as age-coherent mortality forecasting, is present in the above-mentioned Li-Lee model.Thus, Gao and Shi (2021) proposed two alternative extensions to the ordinary LC method: the LC-Geometric and LC-Hyperbolic models.The goal was to achieve long-term age coherence in mortality forecasts while retaining the short-term rotation-type forecasting adopted by Li et al. (2013).Here, Geometric and Hyperbolic refer to the type of decay allowed in the autoregressive (AR) model.
Recently, borrowing the concept of "rotation" proposed by Li et al. (2013), Li et al. (2021) developed a so-called rotation algorithm for the coherent mortality projections of less developed countries, which included all regions of Africa, Asia (except Japan), Latin America, the Caribbean, Melanesia, Micronesia, and Polynesia.Using Li-Lee's model, where "rotation" refers to the effects of age and time components in the projection phase for developing countries; rotation may occur in b i,x and d i , based on their own historical data and the corresponding data from a group of developed countries used as the benchmark group.In their model, the rotation algorithm is controlled by a life expectancy gap function, between the target developing country and the benchmark group, where the gap function can be fitted by (double) logistic functions with some selected threshold levels for convergence of the gap.
Therefore, reliable mortality projection methods, especially the long-term projections of age-specific mortality rates, are crucial for developing countries.However, the recent fast decline in aggregate mortality might not be a long-term behavior (according to Müller and Krawinkel (2005), Austin andMcKinney (2012), andJeuland et al. (2013), the main factors contributing to recent mortality improvements in developing countries (especially for infants, the young, and the working-age population) are modernization, improved healthcare coverage, better nutrition, and prevention of infectious diseases, which can obviously only last for a short period of time); in the long run, the mortality patterns of developing countries could more closely resemble those of more developed countries (see, e.g., Li and Lee 2005).Therefore, predicting long-term, age-specific mortality rates in a developing country by simply extrapolating its historical patterns may lead to implausible results.
Hence, a method with the aforementioned "rotation" helps find a balance between the historical mortality pattern of the developing country and the average mortality patterns of a group of developed countries (the benchmark group).Note that for different target countries, the method proposed by Li et al. (2021) needs to use expert judgments and fit different gap functions with possible different gap thresholds.This can bring restrictions to the unified application of the method.
On the other hand, as mentioned by Aburto et al. (2020), populations with the same life expectancy level may experience substantial differences in the time of death.This indicates that the mortality pattern of a developing country can still be different from the benchmark group even if there is a convergence in the life expectancy gap.Hence, the lifespan disparity, which describes life expectancy lost due to death by an individual at different ages and times (see, e.g., Vaupel and Romo (2003) and Zhang and Vaupel (2009)), may provide additional information when examining the convergence of mortality development between the developing country and the benchmark group.Therefore, to continue the study of coherent mortality forecasting (especially long-term projections based on lifespan disparity) for developing countries, here we propose a unified coherent mortality forecasting method with time-dependent rotation weights based on a benchmark group.A deep neural network, in particular a long short-term memory (LSTM), is used for the projection of the life expectancy and lifespan disparity gaps between the target developing country and the corresponding benchmark group.The projected gaps are used in the control of the rotated time-varying weight parameters in the model during the projection phase for mortality forecasting of the developing country.
In the last decade, neural networks, especially deep neural networks (DNNs), have gained attention in the human mortality modeling and forecasting literature.The purpose of building neural network models for human mortality is to extend the modeling and forecasting ability passed by classical parametric models, such as the LC method and its many extensions.For example, Hainaut (2018) proposed a type of neural network analyzer, which uses an encoding and decoding network structure in the approximation of the nonlinearity among ages, for each year in the data, for a single country.The model is essentially a feedforward neural network extension of the LC method, where a simple, fully connected feedforward neural network was used to learn the common nonlinearities in the lower dimensional structure of the log-forces of mortality, for different ages crossing the years.Nigri et al. (2019) extended the classical LC method by introducing an LSTM model for the time series prediction in the forecasting phase of the LC method, where the related time series (i.e., k t ) were extracted by following the same SVD method by Lee and Carter (1992).Their results show the prediction power of LSTM, compared to the classical time series prediction method (e.g., ARIMA), especially in capturing the nonlinearities.
More recently, Lindholm and Palmborg (2022) discussed the procedures to efficiently use training data in mortality forecasting when applying an LSTM-based Poisson LC method.Marino et al. (2022) further confirmed that an LSTM model can improve the predictive power of the classical LC method by providing a rigorous analysis of the prediction interval for their so-called LC-LSTM model.Note that Nigri et al. (2021) also used the LSTM model for life expectancy and lifespan disparity forecasting.According to the above-mentioned literature, the LSTM model has shown great prediction power for the forecasting of period effects in the classical LC model, as well as the forecasting of life expectancy and lifespan disparity.It is interesting to further consider the question of whether such a powerful tool (LSTM) can help improve long-term mortality forecasting for developing countries.
On the other hand, deep feedforward neural networks (FNNs), as a different way of extending the LC method, may also be applied to mortality modeling; e.g., see Richman and Wüthrich (2021), where they treat human mortality modeling as a classical supervised learning problem.For the application of convolution neural networks (CNNs) in mortality modeling, refer to Wang et al. (2021) and Schnürch and Korn (2022).However, their neural networks and methods are fundamentally different from the LSTM-based neural network structure here; therefore, their results are not directly comparable.Furthermore, some nonneural network machine learning models have appeared in the literature to forecast human mortality rates; for example, see Deprez et al. (2017) and Levantesi and Pizzorusso (2019).
As a result, in this paper, we develop an LSTM-based coherent mortality forecasting method with time-varying rotation structures based on a benchmark group.It provides a unified and model-free mortality projection method for developing countries.The paper is organized as follows: Section 2 introduces some preliminaries, including neural networks, the LSTM, the classical LC, and the Li-Lee model, as well as the definitions of life expectancy and lifespan disparity.The LSTM-based coherent mortality forecasting model is presented in Section 3. Finally, the mortality data and the empirical results are presented in Section 4, followed by the conclusion and some remarks in Section 5.

RNN with LSTM Architecture
Recurrent neural networks (RNNs) are a class of artificial neural networks (ANNs), that can store representations of recent input data through their feedback connections.RNNs have many significant applications in the areas of the speech process, non-Markov control, or time series prediction (see, e.g., Mozer 1991).For instance, let {x 1 , x 2 , . . ., x n } denote the time sequence of input vectors, and {h 1 , h 2 , . . ., h n } denote the time sequence of output vectors; for the simple RNN, the output vector at time-step t is defined as follows: where ϕ is an activation function, W hh and W hx are the kernel weights for previous time step outputs and current inputs, respectively, and b h is the corresponding bias.
However, with the conventional gradient-based back-propagation through time (BPTT) algorithm (see Williams and Zipser (1995)), simple RNNs suffer from the problem of vanishing or exploding gradients (Pascanu et al. 2013).Then, RNNs with an LSTM architecture, or simply LSTMs, were introduced by Hochreiter and Schmidhuber (1997) in order to overcome such vanishing gradient problems.Instead of using all the memory dynamically when processing the data, the LSTM architecture relies both on the memory block and a few gates for controlling data elaborations.
LSTMs have shown great power in natural language processing and time series predictions.According to Marino et al. (2022), the LSTM can be expressed in the following mathematical form.Let N 0 denote the number of neurons within the input layer, N p denote the number of neurons of the p-th hidden layer with n ∈ {1, . . ., P}, and N P+1 denote the number of neurons of the output layer, where P, N 0 , and N p for p ∈ {1, . . ., P} and N P+1 ∈ N.Then, the activation of the p-th hidden layer may expressed as an affine mapping, A (p) : R N p−1 → R N p , where R N p−1 refers to the output produced by the (p − 1)-th hidden layer.The output of an LSTM neuron at any time t in the p-th hidden layer can be expressed as follows: , where ⊙ denotes the element-wise product.The key to the LSTM lies in the following equations, which describe the outputs of four different gates in the architecture: , where σ(x) = (1 + e x ) −1 is the sigmoid activation function, tanh(x) = e x −e −x e x +e −x is the hyperbolic tangent activation function, W k for k = f , i, o, c are the bias terms in the model.Then let D = (x t , y t ), x t ∈ R N 0 , y t ∈ R N P+1 be a dataset where x t and y t are the input variables and associated responses at time t, respectively.Hence, the LSTM is essentially a function, say g LSTM : R N 0 → R N P+1 , with where ψ : R N p → R N P+1 is the activation function at the output layer, W is the set of all weight parameters in the network, and γ t is a noise term, with zero mean and variance σ 2 t , independent of g LSTM .

The Mortality Models
Here, we use two classical mortality projection methods, namely the LC models and Li-Lee model.Specifically, we use the LC method for a rough/first-step mortality forecast of the target developing countries.That is, for such single populations, the Lee-Carter method (Lee and Carter 1992) assumes that the logarithm of the crude death rates (m x,t ) for each age x and year t satisfies (1) and ( 2), where a x summarizes the average level of mortality throughout the time at age x, k t provides the overall level of mortality at year t, and b x measures the age effect of mortality on different periods.Note that the following is also assumed in the LC method (for identification purposes): On the other hand, the Li-Lee method introduced by Li and Lee ( 2005) is an extension of the classical LC method, which can generate coherent mortality projections for multiple countries.In this study, the Li-Lee method (given by ( 3) and ( 4)) is used as the first step in mortality forecasting of the benchmark countries (i.e., a selected group of developed countries).Note that K t , in general, can be fitted by a random walk with drift (i.e., nonstationary), whereas k i,t for all i is assumed to be stationary; that is, k i,t shall be fitted by a random walk without drift or first-order autoregressive model AR(1) with a coefficient that yields a bounded short-term trend; for more details, refer to Li and Lee (2005).
It is obvious that under the Li-Lee method, the long-term mortality trend is uniquely determined by the common period effect K t , which makes the mortality forecasts coherent for all member countries in the group.In the empirical application, if the k i,t of a country is non-stationary, then this country is considered non-coherent with other countries in the group, i.e., there is significant divergence between the historical mortality experience and the common mortality patterns B x and K t .Hence, we may need to exclude them from the selected benchmark group.
Note that in order to ensure comparability between the parameters in the LC and Li-Lee methods in our subsequent analysis, one should impose the same normalization constraints on the key parameters of the two methods (see, e.g., Li et al. 2018).

Life Expectancy and Lifespan Disparity
Most mortality forecasting methods aim to predict how many additional years of life people will gain in the future.Life expectancy at birth, which measures the central tendency, is frequently applied to evaluate the precision of mortality forecasting methods.However, as mentioned by Aburto et al. (2020), populations with the same life expectancy level might experience substantial differences in the time of death, i.e., life expectancy cannot detect distributional variations in lifespan.Therefore, lifespan disparity can serve as an additional indicator to evaluate mortality forecasting methods (see, e.g., Bohk-Ewald et al. 2017).
Let us introduce the notation and definitions of life expectancy and lifespan disparity.Let S(x, t) and µ(x, t) denote the survival function and the force of mortality for an individual age x at time t, respectively, for a given population.These are assumed to be two continuous functions with respect to x and t.Also, denote by e x,t the life expectancy for age x at time t as where and µ(a, t) is the corresponding force of mortality at age a and time t.
To measure lifespan disparity, we take the average number of life years lost at birth (see Vaupel and Romo 2003;Zhang and Vaupel 2009): where d(y, t) are the deaths at age y and time t, and e y,t is the remaining life expectancy at age y and time t.Obviously, (7) shows that lifespan disparity is an indicator representing the life expectancy lost due to death by an individual at age x at time t.Note that lifespan disparity can be described by other measures, such as the standard deviation, interquartile range, Gini coefficient, or prolate index.However, we shall use e † 0,t for lifespan disparity in our analysis.Demographically, apart from its interpretation as the average life years lost or lost living potential, it also provides information about the capacity for further increases in life expectancy (see Bohk-Ewald et al. 2017).

LSTM-Based Coherent Method
In this section, we introduce our LSTM-based coherent mortality forecasting method with an embedded rotation in the time-varying model weight parameters during the projection phase for developing countries.The key to our method is a time-varying LC model, see (8) below, with an LSTM-based component that controls the rotation of the weight parameters in the model.Here, the rotation is referred to as the gradual change in the mortality development pattern during the projection phase (see Li et al. 2021).The time-varying parameters are defined through time-dependent weighted averages of the corresponding projected parameters based on the historical data from the target developing country and the selected benchmark group, respectively, where the weights are rotated based on how close the target developing country is to the benchmark group, under given criteria.
More specifically, we consider the following projection method for the logarithm of central death rate m j x,t for a particular developing country, say j at age x and year t, such that where a j x can be estimated by the average mortality level at age x for developing country j.Here, k j t is the period effect and ε x,t and ϵ t are two zero mean random noises.The main difference between our method, in (8), and the classical LC method is the time-varying b j x,t that measures a time-dependent age effect on mortality at different periods, and d j t describes a time-dependent drift in the random walk model used to project k j t .
Then, we select a group of developed countries as the benchmark group, where the classical Li-Lee method is applied in the projection of the logarithm of the central death rates m i,x,t , for member i in the group, that is, B x measures the common age effect in the benchmark group, and d 0 gives the drift in the random walk model of the common period effect K t .B x and d 0 are the two key components to be extracted from the benchmark group, using the Li-Lee model, to then be used in the rotation during the projection phase.
Here is how the rotation works in our method.As explained in Section 2, first select the life expectancy and lifespan disparity, respectively, as the criteria to describe the gap in terms of mortality levels, between the target developing country and the benchmark group.The varying gap will control the rotation of age and period effects in the projection phase for developing countries.
However, instead of selecting and fitting various (double) logistic functions as well as tailored threshold levels for the gap (see Li et al. 2021), we propose using a unified LSTM model for the gap forecasting.In addition, we use life expectancy, given by ( 6), and lifespan disparity, given by ( 7), respectively, in the construction of the gap function that describes the mortality distance between the target developing country and the benchmark group.
In particular, for notation simplicity, we let e i u and e †i u denote, respectively, the (projected) life expectancy and lifespan disparity at birth for the i-th member in the benchmark group, in year u, and define the corresponding average life expectancy and lifespan disparity at birth, in year u, for the whole benchmark group as follows: Let e j u and e †j u denote the corresponding life expectancy and lifespan disparity at birth for the target developing country/region j, in year u, for u = . . ., T, . ... A unified LSTM model is introduced to forecast the life expectancy and lifespan disparity for both the target developing country and the benchmark group, such that the projected gaps in mortality levels between the developing country and the benchmark group can be expressed as the forecast for life expectancy or lifespan disparity difference.
More specifically, we construct LSTM models for the projection of e • t and e †• t for both target developing countries/regions (i.e., e j t and e †j x,t ) and the benchmark group (i.e., e b t and e †b t ) as follows: where ϵ e t and ϵ e † t are zero mean errors.g e LSTM and g e † LSTM are given by (5), respectively, for life expectancy and lifespan disparity.And W e and W e † are the weight parameters in the corresponding LSTM models (see for example Nigri et al. 2021).The parameters in the LSTM model (i.e., the functional form of g e LSTM and g e † LSTM ) are optimized using an L 2 loss function, namely min Note that, to show the long-term projection power of our method, we need a sufficient number of years of mortality rates in the out-of-sample data, which results in a limited size of the in-sample mortality data for the training of the neural network model.Therefore, in this paper, we only consider a first-order autoregressive approach in the LSTM model, as illustrated in (9), where the neural network learns at each time step the relationship between two consecutive values during the training period (i.e., one-to-one structure).In addition, the method can be extended to more complex LSTM models with the structure of many-to-one or many-to-many, given that the available data are sufficiently large.
Next, we illustrate in detail our LSTM-based coherent mortality forecasting method.Consider the method based on lifespan disparity; for the case with life expectancy, one simply replaces all e †• t by e • t in the corresponding equations.As noted, at the core of the method is the time-varying LC model given in ( 8), where the term for the age effect, b j x,t , and the drift term, d j t , of the period effect, depend on time t.The time dependence is described through a set of time-varying weights, in terms of lifespan disparity gaps, linking the mortality improvements between the target developing country and the benchmark group.Let bj x and dj denote the estimated age effect term and the drift parameter of the period effect term for the target developing country j, based on the classical LC method.Now, let B x denote the estimated common age effect term and d0 denote the drift parameter of the common period effect term, obtained by using Li-Lee's method for the benchmark group.These two parameters provide information on the common mortality improvements of the benchmark group.Then, the next step is to specify how the timevarying b j x,t and d j t in (8) are defined in the mortality projection phase.To be specific, at the beginning of the projection phase, one can simply rely on the historical mortality data of the developing country when forecasting the short-term mortality rates.
In addition, denote the lifespan disparity at birth at time t projected through the LSTM model given in ( 9) as e †j t and e †b t for the developing country/region j and the benchmark group, respectively.Then, define the lifespan disparity gap at time t between the target country/region and the benchmark group as follows: Then, for intermediate or long-term projections, include data from the benchmark group so that the long-term mortality development of the developing country converges gradually to the common trend in this benchmark group.Hence, redefine the age effect and drift terms of the period effect in the LC model as where for each age x and t = T, T + 1, . .., and g † • is given by (10).ω t denotes the time-varying weights that link the projected time-dependent age and period effect parameters, in year t + 1, to the weighted average of the estimated bj x and dj , respectively, with B x and d0 in the first step (see, e.g., Li et al. 2013).To simplify the analysis, we apply here the same weight parameter for both b j and d j in (11).In addition, p ∈ [0, 1] in ( 12) is a tuning parameter that controls the functional form of ω t .We set p = 1 in our analysis such that ω t has a considerably low rate of change when its value is close to zero or one.Note that when t = T, we have ω t = 0, which means that at the beginning of the projection phase, the method relies only on the historical data from the developing country.For t > T, the lifespan disparity gap decreases, such that ω t increases smoothly to one if the lifespan disparity gap diminishes in the future projection phase.Note that if the projected lifespan disparity gap for a particular developing country diverges (e.g., g † t > g † T for t > T), we simply forecast the mortality based on its own historical data (that is ω t = 1 for all t > T).
Finally, we summarize the model structure schematically in Figure 1 below.To be specific, the LSTM-based coherent mortality forecasting model contains two parts: (1) on the left of Figure 1 is a neural network component that contains an input layer, two layers of the LSTM structure (both LSTM layers contain 128 neurons with accompanied dropout layers), and two (fully connected) dense layers that contain 64 and 32 neurons, respectively, with accompanied dropout layers for the output.Note that adding dense layers in the model can provide flexibility in the control of the non-linearity of the model.(2) On the right of Figure 1, the projected life expectancy or lifespan disparity is transferred into a component of the rotation algorithm for the calculating of time-varying weights, and then the projected weights are applied to the time-varying LC model for the forecasting of mortality rates.

Empirical Analysis
This section presents the application of our LSTM-based coherent mortality forecasting method to three developing countries, namely China, Brazil, and Nigeria, which are the most populous countries in their respective continents and also belong to the emerging/emerged markets in the world.According to BBVA (2014), China and Brazil are classified as EAGLEs, i.e., emerging and growth-leading economies that are expected to have GDP increments larger than the average of G7 economies, excluding the US, in the next ten years.Nigeria is classified as NEST, i.e., an emerging country that is expected to have GDP increments lower than the average of the G7-excluding the US but higher than Italy's-in the next ten years.In addition, we apply our method to three developing regions, namely less developed region(s) (LDR), less developed regions excluding China (LDRexChina), and less developed regions excluding the least developed countries (LDRexLDC).The United Nations defines the less developed countries/regions as all regions of Africa, Asia (except Japan), Latin America, and the Caribbean, plus Melanesia, Micronesia, and Polynesia, and categorizes 45 countries as the least developed countries (UN Source: https://unctad.org/topic/least-developed-countries/list(accessed on 21 November 2023)), including 33 countries in Africa, 8 countries in Asia, 1 in the Caribbean, and 3 in the Pacific.To proceed with the empirical results, we first introduce the mortality data used in the analysis.

Mortality Data
In this study, the benchmark group is made up of nine selected developed countries, namely Denmark, Finland, France, the Netherlands, Switzerland, Sweden, the UK, the US, and Japan.The mortality rates of these countries are obtained from the Human Mortality Database.In particular, we use the central death rates in the one-age and one-year blocks, i.e., ages equal to 0, 1, 2, 3, . . ., 97, 98, 99, and years ranging from 1950 to 2019.
The mortality data for the six target developing countries/regions mentioned above are not included in the Human Mortality Database.Hence, the corresponding data are obtained from the population division of the United Nations (UN Source: https://population.un.org/wpp/Download/Standard/Mortality/ (accessed on 21 November 2023)).Note that, a necessary condition for the application of our method is that the life expectancy or lifespan disparities of the target countries/regions converge to the ones of the benchmark group.Hence, a preliminary study is needed to select developed countries that can form a proper benchmark group.Figure 2 illustrates the convergence of life expectancy and lifespan disparity at birth between China and the benchmark group.The life expectancy and lifespan disparity at birth in the Ukraine do not converge.According to Figure 2, one can recognize a spike around the year 1960 in both the life expectancy and lifespan disparity in China.Such mortality outliers are due to the so-called Great Chinese Famine of 1959 to 1961.Hence, in order to reduce the effects of such extreme outliers from China, in the following analysis, we use only the data from the year 1962 to 2019 whenever the data of China are involved (i.e., the cases with China, LDR, and LDRexLDC).

LSTM for Life Expectancy and Lifespan Disparity
As discussed above, in order to construct time-varying weights that depend on the convergence of the life expectancy and lifespan disparity of a developing country to those of the benchmark group, one needs to develop projection models for the corresponding life expectancy or lifespan disparity gap.
Note that in the literature, (see Li et al. 2021), the forecasting of the life expectancy gap based on statistical methods uses different functions (logistic or double logistic) for developing countries/regions.Also, exogenous thresholds need to be introduced to test convergence in the model.The situation is even more complex if the lifespan disparity gap is also introduced in the method.
Here, instead of fitting various functions with thresholds to the life expectancy and lifespan disparity gaps, we use a unified LSTM model (see, e.g., Nigri et al. 2021) for the forecasting and identification of the gaps, compared with the benchmark group, for both the life expectancy and lifespan disparity of all six countries/regions.The projected life expectancy and lifespan disparity using the LSTM model for the six target countries/regions are presented in Appendix A.1.For similar results regarding the projection of life expectancy and lifespan disparity of other countries selected from the Human Mortality Database, refer to Nigri et al. (2021).

Empirical Results
In this empirical study, we carry out an out-of-sample test when training the model with the above-mentioned mortality dataset.For the purpose of long-term predictions, the data are divided into two parts, where the first 35 years of data, from 1950to 1984(from 1962to 1984 for China, LDR, and LDRexLDC), are used for training, and the rest of the data, from 1985 to 2019, are set aside as the test data.To avoid possible overfitting to the training dataset, 20% of the training data are selected randomly as a validation part at each epoch.
To assess our models' projections accuracy, the criteria used are mean square error, root mean square error, and mean absolute error for the projected log-mortality rates in the test data.The forecasting results for the LSTM-based, time-varying LC method, which includes life expectancy and lifespan disparity, respectively, are compared with the traditional LC and Li-Lee methods.All the experiments were performed using Keras with TensorFlow in Python, for the LSTM model, the R package "StMoMo" for LC, and the Li-Lee method for the initial mortality data processing.Note that in the following tables, we use LSTM-ex to denote our forecasting model based on life expectancy, and use LSTM-disp to denote our model based on lifespan disparity.

Results for China, Brazil, and Nigeria
The first empirical results are for the application of our model to the mortality data of China, Brazil, and Nigeria.These selected target developing countries represent demographic trends in their continents.This strategy removes the effect of different ethnic groups on life expectancy or lifespan disparity and demonstrates the generality of the model.The six-year average projection errors are listed in Tables 1-3; more detailed results are presented in Appendix A.2, in terms of projection errors for each year, for males and females, respectively.Tables 1-3 (see Figures A5-A7) show a clear cumulative error in long-term forecasts, which reveals the difficulty of long-term mortality forecasting, especially for the classical LC method.In our method, especially that based on lifespan disparity, this accumulation error is reduced to some extent, making long-term forecasting more reliable.
The results clearly show that the classical LC method underperforms, as it is based on only the historical mortality data of the target country or region.If the current mortality development trend in a developing country is not sustainable, mortality rates will gradually approach those of developed countries (like the benchmark group selected here).Hence, in such a setting, projections based solely on national mortality data are not reasonable.From the above results, the LSTM-based, time-varying LC method with lifespan disparity controlling the rotation in the time-dependent weights, is the most accurate one among the four methods examined here, especially for long-term projections.On the other hand, it is interesting to observe that Nigeria has the most significant projection error reduction when transferring from the classical LC method to our LSTM-based time-varying LC method.

Results for LDR, LDRexChina, LDRexLDC
Finally, in order to demonstrate the projection accuracy of our method, the following illustration is for the mortality data of three developing regions, denoted as less developed region(s) (LDR), less developed regions excluding China (LDRexChina), and less developed regions excluding least developed country (LDRexLDC); see Tables 4-6 and also Figures A8-A10 in Appendix A.3).
Note that the error fluctuations in the prediction results for most of these less developed regions are reduced significantly (see Figures A8-A10), which is reasonable since the less developed regions contain larger populations (i.e., more stable) compared to individual countries.Overall, the LSTM-based, time-varying LC method with lifespan disparity as the control of the rotation in time-varying weights provides the most accurate projections within the four methods examined here.
It is worth mentioning that both our LSTM-based-time-varying LC method and the Li-Lee method incorporate mortality trend corrections based on a benchmark group.However, the empirical results show that, for developing countries or regions, such corrections are better modeled through the projection of life expectancy or lifespan disparity difference with an LSTM model, especially for long-term forecasts.To end this section, we draw heatmaps that show the relative prediction errors (i.e., (predicted value − actual value)/actual value) across all ages and years for the out-ofsample data.The results are presented in Figures 3 and 4. For most ages and years in the six target developing countries/regions (except young males in China and females in Brazil), our model performs well.However, we also observe some cohort effects in the results, especially for the data from China.A possible improvement could be to extend ( 11) and ( 12) to include age dependency in our model.This is a non-trivial extension that will be left for future studies.

Conclusions
Mortality improvements are linked to social progress, for instance, in terms of health, nutrition, education, hygiene, and access to medical assistance.It is difficult to accurately predict mortality development trends, especially over a long-term period.For developing countries or regions, it is particularly important to provide accurate long-term predictions of mortality rates for each age in the population, given that the current mortality data might not reveal sustainable development trends in the long-run.
The proposal here is an LSTM-based coherent mortality forecasting method for developing countries, where the life expectancy and lifespan disparity gaps between the target developing country and the selected benchmark group are used for long-term projections.
In particular, we allow the mortality development pattern of a developing country to be a weighted average of trends generated by its own historical data and the selected benchmark group.And the rotation in the time-varying weights is controlled by the projected life expectancy and lifespan disparity gaps between the developing country and the benchmark group.In addition, we introduce a unified deep neural network model with an LSTM architecture for the long-term forecasting of the gaps in life expectancy and lifespan disparity for all six developing countries and regions in our analysis.

k
for k = f , i, o, c are, respectively, the weight matrices for the four different gates of feedforward connections in the structure, U (p) k for k = f , i, o, c are the corresponding weight matrices for the gates of recurrent connections, and b (p) u = . . ., T, . . ., where T is the number of years in the training/in-sample data and N is the total number of members in the benchmark group.

Figure 2 .
China and Ukraine vs. benchmark group.

Figure 3 .
Figure 3. Relative prediction errors for three developing countries (males at the top, females at the bottom).

Figure 4 .
Figure 4. Relative prediction errors for three developing regions (males at the top, females at the bottom).
to 2019 − ARIMA:blue, LSTM:red Less developed region excluding the least−Male

Figure A2 .
Figure A2.Historical (dotted lines) and forecast (blue for ARIMA, red for LSTM) values of e 0,t for three target regions.

Figure A3 .
Figure A3.Historical (dotted lines) and forecast (blue for ARIMA, red for LSTM) values of e † 0,t for three target countries.

Figure A7 .Figure A8 .Figure A9 .
Figure A7.Forecasting errors by year for Nigeria (females at the top, males at the bottom).

Figure A10 .
Figure A10.Forecasting errors by year for less developed regions, excluding least developed country (LDRexLDC) (females at the top, males on the bottom).

Table 1 .
Six-year (average) prediction errors for China.
* indicates the smallest value of MSE/MAE/RMSE.

Table 2 .
Six-year (average) prediction errors for Brazil.
* indicates the smallest value of MSE/MAE/RMSE.

Table 3 .
Six-year (average) prediction errors for Nigeria.
* indicates the smallest value of MSE/MAE/RMSE.
* indicates the smallest value of MSE/MAE/RMSE.
* indicates the smallest value of MSE/MAE/RMSE.