Prediction of China’s Population Mortality under Limited Data

Population mortality is an important step in quantifying the risk of longevity. China lacks data on population mortality, especially the elderly population. Therefore, this paper first uses spline fitting to supplement the missing data and then uses dynamic models to predict the species mortality of the Chinese population, including age extrapolation and trend extrapolation. Firstly, for age extrapolation, kannisto is used to expand the data of the high-age population. Secondly, the Lee-Carter single-factor model is used to predict gender and age mortality. This paper fills and smoothes the deficiencies of the original data to make up for the deficiencies of our population mortality data and improve the prediction accuracy of population mortality and life expectancy, while analyzing the impact of mortality improvement and providing a theoretical basis for policies to deal with the risk of longevity.


Introduction
At present, China's population aging has entered a stage of rapid development. At the same time, with the rapid advancement of aging, actively coping with population aging has become a national strategy [1]. In 2021, the results of the seventh national census showed that in 2020, the number of people over 65 years old in China reached 190.64 million, accounting for 13.5% of the total population, belonging to the "aging society". The World Health Organization defines societies with 7%, 14%, and 20% of the population over 65 as "aging society", "aging society" and "super-aging society", respectively. According to the research data released by scholars from the global change data Lab (GCDL) in the UK, more than 20% of China's population over the age of 65 will enter the "super-aging society" around 2035 [2].
In China's current economic situation, the demographic dividend should be promoted to a quality-based upgrade, human capital potential should be tapped [3]. To further develop the demographic dividend, we must focus on population aging and on research on newborn life expectancy. In essence, the causes of China's population aging include two aspects: the reduction of birth rate and the extension of average life expectancy. The reduction of birth rate leads to the decline of the proportion of the young and middle-aged population, the extension of average life expectancy, and the decline of mortality, resulting in the rapid growth of the aging population, and finally leading to the aging problem of China's population [4]. Life expectancy is an important factor affecting population aging. According to the information released by the China Health Commission, from reform and opening up to 2019, the average life expectancy of China's population increased from 67.9 years to 77.3 years. With the continuous extension of life expectancy, both economic subjects, life insurance companies, and individual families are suffering from the impact of longevity risk to varying degrees. Mortality prediction is not only the basis of life expectancy prediction but also an important standard to measure the risk of longevity.
Population mortality projections usually include two methods: one is to calculate population life expectancy by projecting sub-age mortality, the other is to calculate sub-age mortality from life tables by projecting population life expectancy [5].The mortality prediction model is mainly developed in the early static model (Makeham model, Gompertz model, H-P model) that describes the change of mortality with age to the classical dynamic model including time and age, such as the Lee-Carter model [6] and CBD [7] model. These are the most basic and classical of mortality prediction models. Later, with the progress of the economy and technology and the constant change of the evolution law of mortality, researchers continuously studied the prediction model of mortality according to the characteristics of actual data. Improvements are mainly made in two aspects: parameter estimation as well as model form. In terms of parameter estimation, Wilmoth [8] proposed weighted minimum and multiplication to solve the parameters of the Lee-Carter model; Lee and Miller [9] put forward three methods to improve the prediction of the Lee-Carter model: selecting fitting period, adjusting period factor with life expectancy, and eliminating "jump error"; Brouhns [10] used Poisson regression to estimate the parameters; Li [11] proposed Lee-Carter model and parameter estimation method under limited data. In terms of the model form, Booth [12] pointed out that the SVD method used in the Lee-Carter model can only extract the first term of a large number of age or time factors, and can expand the model to a complex form with multiple items, and use Poisson regression to estimate parameters. Renshaw and Haberman [13] used the extended form of the two-item Lee-Carter model to predict the mortality rate of British men. Spreeuw presents a mortality model where future stochastic changes in population-wide mortality are driven by a finite-state hierarchical Markov chain, it permits an exact computation of mortality indices [14]. Poisson regression and ARIMA were used to estimate and predict the parameters. Baran [15] used the trinomial Lee-Carter model to predict the mortality rate in Hungary.
Under the background of limited data in China, domestic scholars have also conducted a lot of research on mortality prediction models. There are two main adjustments to mortality prediction models. One is to adjust the parameter estimation methods. The common methods are singular value decomposition, least-square estimation, maximum likelihood estimation, and so on [16][17][18]. Li [19] extended the Lee-Carter model with the MCMC method under limited data conditions and estimated and predicted mortality in the Chinese Mainland with the help of mortality information in Taiwan and Japan. Wang and Zhao [20] defined the improvement factor of mortality under limited data and analyzed the random fluctuation of mortality of the elderly population by the Montecarlo method. Hu [21] used the MCMC method to estimate the parameters of the Lee-Carter model and believed that the Bayesian method effectively reduced the impact of data quality. Wu, Luo, Su, and Gao [22] readjusted the estimated parameters of the mortality model and further analyzed the longevity risk. The second is to focus on the formal expansion of the model, such as the number of expansion factors, the effect of joining the queue, and the expansion of the multi-population model. Hu [23] investigated and compared the prediction effect of the two-factor Lee Carter model The test results show that the goodness of fit and dispersion information criterion DIC of the two-factor model is significantly better than the single-factor model, and better grasps the volatility of mortality over time. Huang and Wang [24] used the Lee-Carter model with cohort effect to fit and predict male mortality. Zhao and Wang [25] used the multi-population Lee-Carter model to derive the iterative weighted least square method based on constraints to predict the mortality rate of the Chinese population. Wang, Zhao, and Chen [26] used multi-population joint modeling to predict the mortality of China's high-age population. For the prediction of the elderly population with missing data, the most commonly used adjustment models in China are the Gomportz model, Weibull model, Logistic model, and Kannisto model. Zeng [27] by comparing various models to fit the elderly population in the four census data, it was found that Kannisto had the best effect. On this basis, some scholars [28,29] later used Kannisto model to predict the mortality of the elderly population by sex in China. There are historical data on China's population mortality, with large fluctuations and poor data quality. There is a deviation between the observed mortality of a specific mantissa age and the actual mortality [30]. In view of this, this paper uses the Lee-Carter model to fit China's mortality by age from 1997 to 2019 and uses Kannisto model to extend the age to 100 years, a more accurate mortality rate is obtained by fitting the data and predicting the parameters to obtain the life expectancy.
We investigated the possibility that the Lee-Carter model could be applied to missing data in China and proposed a single-factor model to explain all mortality dynamics. The current approach was not surpassed by more complex factor models, so we used the single-factor Lee-Carter model as the baseline model [31,32].
Combining its current problems with Chinese population data: In the Chinese census, which combines the population census, the 1% population sample survey and the annual 1% population change sample survey, death data are lacking compared to other aspects of data. Moreover, it is not highly standardized and faces the lack of time series data, and the lack of data for single age group population is becoming more and more prominent, and the lack of data is more obvious in the data for infant and elderly population. Thus, when using the Lee-Carter model, special attention should be paid to the treatment of missing data [33].
The improvements in this paper are as follows:The first is that the parameter estimation cannot be performed using matrix singular value decomposition because the individual age-specific mortality rates are missing or zero and cannot be logarithmic. In order to improve the prediction accuracy, this paper uses spline fitting to supplement the outliers and missing values and then uses maximum likelihood estimation method to estimate the parameter. The quality of Chinese senior population data is poor, and this paper uses the Kannisto method to extend the senior population data to improve the prediction accuracy.
The paper is structured as follows. Material and methods are described in Section 2. In Section 2, the first part gives the benchmark models of Lee-Carter and the indicators of mortality used in this analysis. The second and third part gives the estimation method of parameters. Section 3 gives estimated results of parameters and predicted mortality. Section 4 gives the impact of mortality improvements, making separate projections of life expectancy for both sexes and further analyzing the risk of longevity associated with increased longevity.

Materials and Methods
In this paper, the prediction of mortality is mainly divided into the following steps: Step 1: • Model classical algorithm Establish a logarithmic center mortality model and estimate the model parameters through past data. The basic formula of the model is: We let κ t be the central rate of death at age x and in yeart. where α x and β x are age-specific parameters, x is a time-varying index, and ε x,t is a random item with zero means and finite variances,the variances is σ.
Since there are numerous groups that meet the requirements of the equation, Lee and Carter have standardized the parameters to estimate the above parameters, we let where t is the total number of calendar years included in the mortality data to be estimated. α x is the mean of logarithmic center mortality for age. κ t indicates the intensity of mortality at year. Derived from both sides of Equation (1), · ∂t ∂κ t . That is, β x represents the sensitivity of logarithmic mortality change when the mortality intensity changes. • Maximum likelihood estimation is used to estimate the parameters in Lee-Carter model. The method assumes that the number of deaths d x,t follows the Poisson distribution with parameter λx, t, that is: In Equation (3), D x,t is the annual mortality of the population aged x in year t, and E x,t is the death risk exposure of the population in the year of death. In practical application, it is often replaced by the number of people in this year N x,t . By maximizing the following likelihood function, we can obtain an estimate of the sum.
where C is a constant, the estimated values of parameters al pha x , beta x and kappa t are given by Newton iterative method.
where θ (v) is the parameter of the v-th iteration, and is the corresponding likelihood function. Therefore, the iteration formula of the sum of parame- whereD (v) is the estimated value of the v-th death toll under the assumption of Possion distribution, where w x,t is the weight, w x,t = 1, Data missing, 0, Data available. .
After the above iterative process, D D x,t ,D xtt is small enough, Likelihood formula approaches maximum, Step 2: predict κ t .

•
In the second stage, κ t is modeled and fitted according to the time series method. The model is very simple [34,35]. It only needs endogenous variables without the help of other exogenous variables. In most studies, ARIMA (p, d, q) process is used to model κ t series. ARIMA's prediction model can be expressed as: where φ represents the coefficient of AR , θ represents the coefficient of Ma. In this way, the time series method can be used to simulate the time effectκ t , and the value of the predicted yearκ t can be obtained by trend extrapolation. On this basis, the future mortality m x , t, is predicted, that is ln(m x,t ) =α x +β xκt .

Data Source and Processing
The data set used in this paper is extracted from the China demographic and Employment Statistical Yearbook and the China demographic statistical yearbook from 1998 to 2020, including the death toll and random inspection data from 1997 to 2019. In the data set used in this paper, 2000 and 2010 are census data, 2005 and 2015 are 1% sampling data, and other years are population change sampling. According to the characteristics of the selected data and the need of empirical analysis, the specific data processing and assumptions are as follows: 1.
In order to keep the number of deaths and the overall population in each year roughly in order of magnitude, divide the census data by 1000, and divide the 1% spot check data by 10. According to the mortality data of different ages, the death toll of each age is adjusted based on 1 million; 2.
It is assumed that both 1% population sampling and variable sampling methods have good random sampling characteristics; 3.
The data for most years are from 0 to 89 years old, and the data for 2000, 2005, 2010, and 2015 are from 0 to 99 years old. This paper uses kannisto method to fit the missing data of 90-100 years old in the data; 4.
When the death toll data is zero or missing, the cubic spline interpolation method for the year is used to fill in. Figure 1 depicts the full age mortality rate for both sexes in 1997-2019 and the missing data is interpolated by the above method. From Figure 1, we can see the changes in mortality in recent years. The mortality of young children and the elderly fluctuates greatly, which interferes with the establishment of the model.  • Forecast value of κ t . κ t is a time series model, the unit root AR(1) model is usually employed in the studying of longevity risk, which is a random walk process with drift term. But in this paper, by fitting the ARIMA model, using AIC criteria and t-test statistics, the optimal ARIMA model is selected. Both men and women are ARIMA (1, 2, 1). As discussed before, the parameter κ t is used for the forecast; the blue lines of κ t in Figure 4. Table 1 are the predicted results of κ t value.

Age Factor Expansion
There is a serious lack of data on the elderly population in China. Even the census data are underreported and concealed. The lack of data leads to large fluctuations in the curvature of the tail. In China's Demographic Yearbook and China's demographic and Employment Statistical Yearbook, there is a lack of population data over 90 years old. Therefore, this paper uses kannisto model to correct the data on the elderly population. By extending the age factor of the model, the limitation of the highest age mortality rate in China's population data can be lifted, making the results more realistic.
The form of kannisto model is as follows: where u x is the death power, α and β are parameters to be estimated. In this paper, the mortality data of 60-89 years old are used for fitting, and the parameters α and β to be estimated are calculated to obtain the death power. Through the relationship between the death power and the probability of death and the central mortality, we can obtain the expanded mortality rate.
where q x is the probability of death and the m x is the central morality. Figures 5 and 6 shows the mortality rate of the high-age population after the expansion of the kannisto method.

Mortality Prediction
According to the above methods, the mortality rates by age and sex in the future are shown in Figure 7 and Table 2. (Due to space limitations, this paper shows the mortality data by age and sex in 2025, and 2030, and only some age data are listed).  By comparing the mortality rates in 2019, 2025 and 2030, it can be seen that the mortality rate is continuously improving and the life expectancy is continuously extending. The difference between this and the future actual mortality rate will lead to the occurrence of longevity risk. At the same time, by comparing the changes of male and female mortality, the female mortality fluctuates greatly. In general, the risk of longevity in China is gradually increasing.

Residual Plot Test
By analyzing the residual fit plots, we can visually analyze the model fitting effect and determine whether the model misses important influence information. Ideally, the residuals should be 0. Considering the realistic factors, they should be randomly distributed around 0. The following Figure 8 shows the distribution of residuals.
The residual distribution plot shows that the model fits better, and the distribution of residuals is more concentrated in men than in women, indicating that the fit is better than in women.

MAPE Value Test
In the previous section, we obtained the prediction results for mortality, and to evaluate the model, we used the mean absolute percentage error(MAPE) to evaluate the model fit results.
where y i is the true value,ŷ i is the predicted value, and n is the number of periods.
The calculation results are shown in the following table.
Combined with the data in the Table 3, the deviation of the period forecast results is small and acceptable.

Conclusion Impact on Life Expectancy
The average life expectancy of a cohort is calculated from the level of mortality in each year, using the level of mortality for each age of the population in the same year instead of the level of mortality for the same generation at different ages. Before using the mortality rate to find the average life expectancy, the mortality rate m x,t is converted into the probability of death q x,t , and then the probability of death is used to calculate the average life expectancy.
Then life expectancy: The part of life expectancy projection results are shown in the following Table 4. The short-term life expectancy prediction results given above using the Lee-Carter model, which has robust prediction results in the short term, but may not be robust in the long term with its narrower intervals.By comparing with the average life expectancy of Chinese population published by WHO, the error is small and the fitting result is good. However, we found that the average male life expectancy projection is higher and the average female life expectancy projection is lower. Moreover, the larger difference of female projections may be due to the lower quality of female population mortality data compared to male population mortality data. Therefore, in the subsequent projections, attention should be paid to the correction of the female crude mortality population. By comparing with the average life expectancy of Chinese population published by WHO, the error is small and the fitting result is good.

Discussion
Population health is a hot issue in today's society, and it is also an issue that China should always pay close attention to In the process of China's aging population, given the actual situation of China's population, we should actively face a series of problems brought by the aging population, and deal with the challenges it brings. We must carry out basic research on China's population situation. Mortality is an important indicator to measure the overall health status of a country's population, which can directly reflect the quality of life of the population and is also an important basis for judging the social-economic, scientific, and cultural levels. The prediction of mortality can monitor the expected problems of China's population, which is of great significance for responding to and implementing the strategy of actively dealing with aging.
Based on the data of the 1997-2019 national census, the 1% population sampling survey, and the annual 1‰ population change sampling survey, this study uses the Lee-Carter mortality prediction model to predict the mortality of both sexes in China. Through the study, the following conclusions can be drawn.
With the continuous development of society and the continuous improvement of the level of science and technology and medical treatment, on the whole, the death rate in all regions of China has improved significantly from 1997 to 2019, and at the same time, the life expectancy is also increasing. In terms of age, the mortality rate of each age has decreased, but the improvement of each age is very different. Among them, the mortality rate of the 0-year-old group and the elderly population has decreased the fastest. Changes in mortality are closely related to age. This paper does not group the age groups, and directly uses the data of each age from 0 to 89 years old. This method can obtain the information of the original data to a greater extent, but it also makes the mortality data more volatile and not smooth enough to a certain extent. Therefore, this paper fills and fits the missing data with spline interpolation, which can make the original data smoother. Second, this paper uses Kannisto to extend the data of the senior population, which makes the prediction of life expectancy more accurate. However, the prediction process in this paper also has some degree of drawback: the period of having higher quality population data in China is short, and using ARIMA model for Kt value prediction increases the prediction bias to some extent. More scenarios should be considered when predicting K values. Finally, from 2020 to 2030, the population mortality rate, whether male or female, will decline significantly, and the decline of age-specific mortality rate will affect people's life expectancy, which will increase the pressure of social aging to a certain extent, bring public financial burden to the government, but also bring risks to insurance companies and individuals. With the improvement in mortality and the increase in life expectancy, the government needs to increase pensions. Therefore, the government should improve the operation of pension insurance funds and balance revenue and expenditure. For insurance companies, when formulating the rates of different insurance products, based on referring to the life table, they should also take full account of the risks brought by the improvement of mortality and reasonably design insurance policies.
In conclusion, through the prediction of mortality, the accurate prediction of mortality and population structure will help the decision-making level to formulate population policies, economic policies, and pension policies in line with the future population characteristics, and weaken the negative impact of population aging.

Conclusions
Population mortality is an important step in quantifying the risk of longevity. There is a lack of data on population mortality in China, especially for the elderly population. In this paper, we first use spline fitting to supplement missing data, and then use Lee-Carter dynamic models to predict population mortality in China, including age extrapolation and trend extrapolation. This paper improves the accuracy of population mortality prediction by filling the gap in the original data, and also predicts the life expectancy of the Chinese population, and the results can provide a theoretical basis for policies to address the risk of longevity. Data Availability Statement: The data set used in this paper is extracted from the China population and Employment Statistical Yearbook and the China population statistical yearbook from 1998 to 2020, including the death toll and random inspection data from 1997 to 2019.

Conflicts of Interest:
The authors declare that there is no conflict of interest regarding the publication of this paper.