Next Article in Journal
On the First Crossing of Two Boundaries by an Order Statistics Risk Process
Next Article in Special Issue
Assessment of Policy Changes to Means-Tested Age Pension Using the Expected Utility Model: Implication for Decisions in Retirement
Previous Article in Journal
Robust Estimation of Value-at-Risk through Distribution-Free and Parametric Approaches Using the Joint Severity and Frequency Model: Applications in Financial, Actuarial, and Natural Calamities Domains
Previous Article in Special Issue
Actuarial Applications and Estimation of Extended CreditRisk+
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stochastic Period and Cohort Effect State-Space Mortality Models Incorporating Demographic Factors via Probabilistic Robust Principal Components

by
Dorota Toczydlowska
1,*,
Gareth W. Peters
1,2,3,
Man Chung Fung
3 and
Pavel V. Shevchenko
4
1
Department of Statistical Science, University College London, 1-19 Torrington Place, London WC1E 7HB, UK
2
Man Institute of Quantitative Finance, University of Oxford, Oxford OX1 3BD, UK
3
CSIRO, Canberra, ACT 2601, Australia
4
Department of Applied Finance and Actuarial Studies, Macquarie University, Sydney, NSW 2109, Australia
*
Author to whom correspondence should be addressed.
Risks 2017, 5(3), 42; https://doi.org/10.3390/risks5030042
Submission received: 7 February 2017 / Revised: 31 May 2017 / Accepted: 17 July 2017 / Published: 27 July 2017
(This article belongs to the Special Issue Ageing Population Risks)

Abstract

:
In this study we develop a multi-factor extension of the family of Lee-Carter stochastic mortality models. We build upon the time, period and cohort stochastic model structure to extend it to include exogenous observable demographic features that can be used as additional factors to improve model fit and forecasting accuracy. We develop a dimension reduction feature extraction framework which (a) employs projection based techniques of dimensionality reduction; in doing this we also develop (b) a robust feature extraction framework that is amenable to different structures of demographic data; (c) we analyse demographic data sets from the patterns of missingness and the impact of such missingness on the feature extraction, and (d) introduce a class of multi-factor stochastic mortality models incorporating time, period, cohort and demographic features, which we develop within a Bayesian state-space estimation framework; finally (e) we develop an efficient combined Markov chain and filtering framework for sampling the posterior and forecasting. We undertake a detailed case study on the Human Mortality Database demographic data from European countries and we use the extracted features to better explain the term structure of mortality in the UK over time for male and female populations when compared to a pure Lee-Carter stochastic mortality model, demonstrating our feature extraction framework and consequent multi-factor mortality model improves both in sample fit and importantly out-off sample mortality forecasts by a non-trivial gain in performance.

1. Introduction

Modelling the “term-structure” of age specific mortality rates by gender and country has enjoyed a growing resurgence in the actuarial and statistics literature. This is primarily driven by the importance of better understanding and forecasting age specific mortality rates for purposes of understanding longevity risk, pension design, annuities pricing and population studies.
The most widely utilised class of stochastic mortality models in actuarial science and statistics arise from the class of regression or state-space models that incorporate explanatory factors which correspond to stylised latent stochastic factors representing structural features in the evolution of the age specific mortality rates. Typically these latent stochastic features are interpreted as either temporal effects, period effects and cohort effects. The most famous class of such models is the Lee-Carter type models, see a summary recently in Fung et al. (2017) and references therein.
In this paper we aim to combine these classes of stochastic mortality model with other observable exogenous features obtained from a range of demographic data sets. The purpose being that they offer two advantages to standard Lee-Carter models, firstly they may improve predictive power of the models, secondly they may improve the interpretation of behaviour of the dynamic of the “term-structure" of age specific mortality rates.
We expect the mortality experience and demographic data to be characterised by a strong causal and time-varying interaction. There is an existing literature on incorporation of demographic data in stochastic mortality models. However, unlike the state space mortality age-term structure dynamic factor model approach we develop in this manuscript, the existing works have been primarily focused on regression type structures that consider single age group models. Furthermore, there is limited work on feature extraction methods in this space. We highlight a few related approaches that have considered demographic data to study single age group mortality. We comment on some of the widely used exogenous factors in such studies, which include for instance macroeconomic variables, as well as demographic variables. In (Hanewald 2011) and (Niu and Melenberg 2014), the authors investigate the links between the economic growth and morality trends through a class of single age group regression models which are estimated in a frequentist estimation framework. In addition to the period effect in the standard Lee-Carter mode, the authors incorporate gross domestic product (GDP) as an observable factor what improves the in-sample and out-of-sample performance of the model.
Other classes of factors that have been explored in such settings also include cause-of-death categorical variables, what has been also partly investigated in (Hanewald 2011). The relation between causes of death and their influence on mortality has started to be more detailed explore since the accessibility of the data improved. In Murray and Lopez (1997), the authors develop the scenarios of future mortality based on a multi factor linear regression model where the logarithm of the rate of mortality per age group, sex and clustered cause of death is regressed against the socio-economic, educational, technological and cause-of-death related predictors. The Bayesian inference has been adopted in (Girosi and King 2008) to build a regression framework for forecasting mortality rates which are age, sex, country and case of death specific. The work is mostly focused on the methodological side of the forecasting but uses as examples the applications of demography and macro-epidemiology data as explanatory variables for the regression-type model of mortality. Moreover, the dependency structures between cause-specific death rates are studied in (Gaille and Sherris 2015). The authors use Vector Error Correction Models to examine such causal relations within the countries.
The usage of principal components of the mortality curves as linear regressors has been examined in (Hyndman and Yasmeen 2012). The authors explored the common features of the data applying the functional version of Principal Component Analysis. The concept is further developed in (Erbas et al. 2010), where the cause-of-death-specific smoothed mortality curves are treated as functional data. The obtained principal components serve as basis functions in functional data analysis.
In the following study, we aim to broaden these concepts and investigate the impact of global mortality trends given by various sets of international demographic data, and their potential influence on the mortality experience in one country, in our case study the United Kingdom. To achieve this in a manner suitable for incorporation in multi-age stochastic mortality models we need to perform a parsimonious feature extraction method in order to reduce the large dimensional sets of data to a form suitable for inclusion in such a mortality model. Therefore, we introduce a methodology which is not exclusive to one type of demographic data and is capable of handling the analysis jointly over many different exogenous variables.
To achieve this we must undertake several tasks: the first is to explain a canonical and principled approach to combining of such demographic time series data into the stochastic mortality models, for which there is a number of structural approaches we develop and present.
The second aspect is that large demographic data sets are now available, but a naive incorporation of such features into a stochastic mortality model would result in far too many parameters to perform estimation, the models would be overfit and would not provide good generalisation properties for out-off-sample forecast performance. Therefore, we introduce a class of probabilistic, statistically robust feature extraction approaches to reduce dimensionality and capture core information present in the demographic data that can be more parsimoniously included in the stochastic mortality models. The standard concept of robust Principal Component Analysis by means of M- and S- estimators cannot be easily utilized since the demographic data is not equal length and contains missing values across different age groups. Hence, we adopt a probabilistic formulation of Principal Component Analysis which additionally allows to model the hidden process of missing values.
Another challenge is the issue of parameters uncertainty in mortality modelling which we address by adopting the Bayesian Inference framework based on efficient Markov Chain Monte Carlo as in Fung et al. (2017). The estimation of the model is achieved via a Rao-Blackwellised Gibbs sampler. We sample the static parameters via conjugate Gibbs sampling steps which are followed by Forward Backward filtering sampler for state variables to inference from the resulting posteriors.
The contributions of each part of the paper are as follows. Firstly, we briefly overview the concept of the mortality modelling and discuss the state-space formulation of the Lee-Carter model with cohort effects and impose identification contains. Section 3 provides with several illustrations of how to incorporate observable factors into Lee- Carter models. The discussion is followed by an introduction to features extraction by means of Principal Component Analysis. Section 5 extends the standard Principal Component Analysis terminology to the probabilistic setting and derives the steps of its estimation via Expectation-Maximisation Algorithm in order to combine Principal Component with missingness. An overview of the data is given in Section 7 whereas numerical illustrations of empirical studies are presented in Section 8. Finally, Section 11 concludes.

2. Period and Cohort Effect Stochastic Mortality Models: State-Space Formulation

We begin this section by briefly recalling the classical two factor period effect and cohort effect models that have been proposed in Renshaw and Haberman (2006); Pedroza (2006) and Kogure and Kurachi (2010). This includes in particular the state-space formulation of such models which was developed in Fung et al. (2016) and Fung et al. (2017).
Extension of Lee-Carter model to cohort features proposed in (Renshaw and Haberman 2006) introduces the concept of the stochastic cohort factor, denoted by γ t x , is incorporated into the one factor stochastic Lee-Carter (Lee and Carter 1992) stochastic period effect, denoted κ t to produce a two factor stochastic model. This second cohort factor, like the period effect factor, can also have an age-modulating coefficient, denoted by β x γ . In this work we adopt the recommendations discussed in (Cairns et al. 2009); (Haberman and Renshaw 2011) and (Hunt and Villegas 2015), where it is proposed to simplify this feature to be a constant age-modulating coefficient accross all age groups, given by:
log m x , t = α x + β x κ t + γ t x ,
where x { x 1 , , x p } and t { 1 , , T } represent age and year respectively. m x , t denotes the mortality rate in age group x and time t. α x and β x are the age specific static parameters of the model. This simplifying assumption that β x γ = 1 (or generally any constant other than one) is supposed to improve estimation performance when fitting the model in practice. Furthermore, there is discussion in the literature to argue that it may also be justified based on empirical findings. By studying the mortality experience of England and Wales males, the study of (Willets 2004) finds that the cohort effect is not “wearing off” with increasing ages and the mortality improvement rates, defined as 1 m x , t / m x , t 1 , of different cohorts seem to be rather stable. Together with the consideration of the convergence problem, one may argue that it is indeed reasonable to assume that β x γ = 1 to ensure estimation can be successfully performed for a range of mortality data while the explanatory power of the simplified model is comparable to the full model.
Next we recall the two-factor state-space formulation of the Lee-Carter type period-cohort models for stochastic mortality, see derivations and properties in Fung et al. (2016) and Fung et al. (2017). Note, we adopt the same standard notation as proposed in these papers to present the models in this manuscript.
The formulation of stochastic period-cohort models in state-space form is given by specification of both an observation equation and a state equation. Let y x , t = log m ^ x , t where x = x 1 , , x p and t = 1 , , T . The general form of the observation equation (when β x γ is flexible) of the cohort model Equation (1) is given in matrix form by (recall that γ t x : = γ t x ):
y x 1 , t y x 2 , t y x p , t = α x 1 α x 2 α x p + β x 1 β x 1 γ 0 0 β x 2 0 β x 2 γ 0 β x p 0 0 β x p γ κ t γ t x 1 γ t x 2 γ t x p + ε x 1 , t ε x 2 , t ε x p , t ,
where iid noise terms ε x , t are included as we aim to model the crude death rates. In a given year t, we identify in this model the state vector as ( κ t , γ t x 1 , , γ t x p ) which represents the p + 1 dimensional latent unobserved stochastic factor driving the observed log-mortality rates. The dynamic of this stochastic state vector is then specified in the state equation. In this work, we consider the state-space model given in matrix form as follows:
κ t γ t x 1 γ t x 2 γ t x p 1 γ t x p = 1 0 0 0 0 0 λ 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 κ t 1 γ t 1 x 1 γ t 1 x 2 γ t 1 x p 1 γ t 1 x p + θ η 0 0 0 + ω t κ ω t γ 0 0 0 .
In this particular instance, we assume κ t is a random walk with drift process (ARIMA(0,1,0) with a constant) and the dynamics of γ t x 1 are described by a stationary AR(1) process with drift (ARIMA(1,0,0) with a constant) where | λ | < 1 . One may consider other dynamics for γ t x 1 by specifying the second row of the p + 1 by p + 1 matrix in Equation (3). For example, an ARIMA(2,0,0) process for γ t x 1 can be assumed if one fixes
κ t γ t x 1 γ t x 2 γ t x p 1 γ t x p = 1 0 0 0 0 0 λ 1 λ 2 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 κ t 1 γ t 1 x 1 γ t 1 x 2 γ t 1 x p 1 γ t 1 x p + θ η 0 0 0 + ω t κ ω t γ 0 0 0 ,
The matrix form of Equations (2) and (3) can be expressed succinctly as
y t = α + B φ t + ε t , ε t i i d N ( 0 , σ ε 2 I p ) ,
φ t = Λ φ t 1 + Θ + ω t , ω t i i d N ( 0 , Y ) ,
where φ t = ( κ t , γ t x 1 , , γ t x p ) , I p the p-dimensional identity matrix and Y is a p + 1 by p + 1 diagonal matrix with diagonal ( σ κ 2 , σ γ 2 , 0 , , 0 ) . The matrices α , B , Λ and Θ can be easily identified in Equations (2) and (3). For simplicity we assume homoscedasticity in the observation equation; heteroscedasticity can be incorporated straightforwardly as developed in Fung et al. (2016).
We adopt the identification constraints which are based on Hunt and Villegas (2015) and are broadly discussed and examined in Fung et al. (2017). These are given by
x β x = 1 , x β x γ = 1 , t κ t = 0 , c = t 1 x p t N x 1 γ c = 0 .

3. Demographic Factor Model Extension to the Period-Cohort Stochastic Mortality State-Space Models

In this section we demonstrate several approaches one may adopt to extend the state-space model formulations presented previously for the Period-Cohort stochastic mortality models to allow for incorporation of additional observable covariate factors. The form of factor model we develop here is generic and in future sections we will develop a framework for factor extraction from demographic data that can be used in the models developed in this section.
We note that there are two fundamental ways to develop a factor time series based regression structure for incorporation of demographic data to a stochastic mortality model. We advocate in this paper an approach which is specifically developed to work with data which may be high dimensional in nature, structured but be represented by short time series lengths. This type of data is particularly prevalent in demographic studies. The main concept here is that the feature extraction is performed over the entire available time series of observable demographic data. The resultant features extracted are then added to the stochastic mortality model in a static form but with dynamic latent state processes for the factor loadings over time. That is the effect of the factor incorporated will be allowed to time vary through the factor loading. This approach has the advantage of not having to model explicitly the demographic data which may have a complex structure and furthermore, only requires forecasting components of the latent factor loading process. This is often significantly easier to perform since one may use for instance a standard parametric time series model such as a VAR model for their temporal evolution. Note, such approaches as this are also utilised in other financial term-structure state-space models for instance in the context of yield curve modelling, such as the dynamic Nelson-Siegel model of Diebold and Li (2006). However, we believe to the best of our knowledge we are the first to propose this type of factor model framework for incorporating demographic factors into stochastic mortality models.
We assume that we have an available set of factors that can be country, population, gender or age specific features. We wish to incorporate these age-specific and country-specific demographic/population information into the cohort state-space model described by Equations (5a) and (5b).
We will denote by F t the p × k factors matrix where p may represent number of age groups and k may represent number of age specific factors. As in any feature based regression factor analysis such as PCA Regression (Jolliffe (2002)), we treat F t as extracted via feature extraction methods from an exogenous observable input that is believed to have potential influence on the age specific mortality rates under study in the responses, over time. Then for each feature vector regressor, extracted from the exogenous demographic data, we will add this feature to the state-space model.
There are numerous structural ways to achieve this in a state-space model. For instance, the factor may either influence all age groups equally by entering the factor into the state equation, or it may influence each age specific mortality rate differently by adding it in the observation equation. Of course, there may also be a combination of such approaches, dependent on which demographic data the feature was extracted from the context of the model construction.
The influence of the feature on the log mortality is reasonable to assume it varies over time, so to achieve this we will specify a time dynamic for the regression factor loading. This requires that we specify an additional latent variable, a p k dimensional vector ϱ t which denotes the vector of factor loading for year t. We assume ϱ t to be modelled by VAR(1) process given by
ϱ t = Ω ϱ t 1 + Ψ + ω t ϱ , ω t ϱ i i d N ( 0 , σ ϱ 2 I p k )
with homogeneous variant for covariance matrix of error term ω t ϱ . ϱ t is a dynamic regression parameter for the factors matrix F t which specifies the impact of x i { x 1 , , x p } age group and m { 1 , , k } component corresponding to [ F t ] i , m by ϱ t i , m element.
As noted, depending on the interpretability of the desired model, one may incorporate F t into observation Equation (5a) (Case 1) or into the latent dynamic of either calendar year factor period effect κ t (Case 2) or cohort factors vector γ t (Case 3) from the state Equation (5b).
Next we develop the extended model of Equations (5a) and (5b), which incorporates information F t . The general notation of the model is as follows
y t = α + B ˜ t φ ˜ t + ε t , ε t i i d N ( 0 , σ ε 2 I p ) ,
φ ˜ t = Λ ˜ φ ˜ t 1 + Θ ˜ + ω ˜ t , κ ˜ t i i d N ( 0 , Y ˜ )
where φ ˜ t = ( φ t , ϱ t ) is a ( p + p k + 1 ) × 1 latent process vector and
Θ ˜ = Θ ( p + 1 ) × 1 Ψ p k × 1 ( p + p k + 1 ) × 1
is a vector of drift parameters for state equations, where Ψ corresponds to the model Equation (7) of  ϱ t . We assume independence of error terms in latent variables what gives the following structure of a covariance matrix for the state equation error term ω ˜ t
Y ˜ = Y ( p + 1 ) × ( p + 1 ) 0 0 σ ϱ 2 I p k ( p + p k + 1 ) × ( p + p k + 1 )
Let us specify the following two objects, F ˜ t = j = 1 k [ F t ] j , · , for ⨁ being a direct sum operator, and  f ˜ t = v e c F t T , that is
F ˜ t = [ F t ] 1 , · 0 0 0 0 [ F t ] 2 , · 0 0 0 [ F t ] p , · p × p k and   f ˜ t = F t 1 , 1 F t 1 , 2 F t p , k p k × 1
where [ F t ] j , . and [ F t ] j , m represent the vector of the jth row of the matrix F t and the element corresponding to jth row and mth column, respectively. The structures of the other matrices and vectors for extended model Equations (8a) and (8b) depend on the introduced cases, that is
B ˜ t p × ( p + p k + 1 ) = B p × ( p + 1 ) | F ˜ t for Case 1 , B p × ( p + 1 ) | 0 p × p k otherwise , Λ ˜ ( p + p k + 1 ) × ( p + p k + 1 ) = Λ ( p + 1 ) × ( p + 1 ) 0 ( p + 1 ) × p k 0 p k × ( p + 1 ) Ω p k × p k for Case 1 , Λ ( p + 1 ) × ( p + 1 ) f ˜ t T 0 p × p k 0 p k × ( p + 1 ) Ω p k × p k for Case 2 , Λ ( p + 1 ) × ( p + 1 ) 0 1 × p k F ˜ t 0 p k × ( p + 1 ) Ω p k × p k for Case 3 .
Given the formulated demographic factor model of the period-cohort stochastic mortality state-space models, we will denote this class of model by notation (DFM-PC). The estimation of this model will be achieved through a Bayesian model formulation and a specialised Markov chain Monte Carlo sampling framework based on Forward-Backward sampler for the latent state components and block-Gibbs conjugate sampling for the static model parameters. This follows closely the detailed framework developed extensively in Fung et al. (2016) and Fung et al. (2017), therefore, we repeat this only in relevant details in the Appendix A.
Remark 1.
The matrix F ˜ t contains the observed (feature extracted) with exogenous factors and is a component of the model which is conditioned. Therefore it does not require to be estimated and, at this point, represents a known deterministic constant. As such F ˜ t is fixed and held constant for the time period which the factor model is run. Hence, it is not a parameter but a covariate which is observed and deterministic. Therefore, the identifications constraints given by Equation (6) are valid for the new model.
Remark 2.
Time series in the study undertaken are typically of the length less than 100. Compared to the number of available samples, the classical Lee-Carter model with cohort effect requires large number of parameters and has been already reported to be prone to overfitting as noted in (Cairns et al. 2009); (Haberman and Renshaw 2011) and (Hunt and Villegas 2015). The new model with the latent process introduced in Equation (7) provides with additional number of parameters. Therefore to decrease the risk of overfitting we choose to stay with the assumption of VAR(1) process for ϱ t . We believe that the parsimony argument to keep the autoregressive structure without too many additional parameters was more important that adding more lags in this study.

4. Approaches to Demographic Feature Extraction via Robust Probabilistic Principal Components

As we will demonstrate in the application in this paper there is a variety of issues to consider when undertaking feature extraction in demographic population data of relevance to modelling in stochastic mortality models such as those presented previously. In addition, we would like to make sure we achieve a parsimonious model presentation, where we extract features from the demographic data that are the most informative.
For instance, if we have d countries demographic data to consider where p denotes the number of different demographic attributes observed that can be considered, then the p × d matrix of this data in year t will be denoted by Y t . We assume that Y t is observed (or partially observed) over periods t 1 , , T . We do not wish to utilise the raw demographic data Y t as in general it will produce a model with too many parameters, therefore we resort to feature extraction methods based on minimizing some pre-specified projection pursuit index.
Our attention is placed to linear methods of dimensionality reduction, more precisely, those expressible as linear projections as defined in (Friedman and Tukey 1974) which includes Principal Component Analysis (PCA) and its extensions and robust alternatives. In this paper, we have focussed on the use of Principal Component Analysis, hence, we incorporate the basis vectors of the projected lower rank space as the most meaningful factors in terms of variation.
In this regard we consider obtaining the column wise pre-whitened Y t which we can then estimate the sample mean and sample covariance matrix for demographic time series data Y 1 , , Y T , which will be achieved robustly in this paper. This approach then produces a lower rank matrix which is obtained by projection according to
X t = Y t F
for F being the first k selected eigenvectors of the covariance matrix robustly estimated from sample demographic date Y 1 , , Y T . These factors are then entered into the state-space model as presented previously.
In the following subsections we introduce progressively the feature extraction methods that should be considered for demographic data, which have been developed to deal with real data issues such as missing data and outliers which may effect the feature extraction process.

4.1. Non-Stochastic Principal Component Analysis

Let us denote N × d matrix Y as original data set, where a row of the matrix is a single d-dimensional observation in a given moment of time. The goal of Principal Component Analysis is to identify the most meaningful unit length basis to re-express a data set Y . The purpose of a new basis is to better filter out the noise and reveal hidden structure. Therefore, PCA looks for the given projection of the observation data
Y N × d W d × d = X N × d
where W is a d × d matrix denotes a linear projection. The columns of W are the new basis vectors, that is W T W = I d , and express rows of X .
The goal of re-expressing Y in meaningful way means that PCA aims to lower a redundancy in data set, i.e., leads to removing the linear dependencies which provide measurements with additional noise. In mathematical terms, the goal can be written for i , j columns of X
X · , i T X · , i = W · , i T C Y W · , i ,
and
X · , i T X · , j = W · , i T C Y W · , j = 0 ,
where C Y = Y T Y . We seek a linear combination given by Equation (13) that maximizes the overall variance of X , C X = X T X . The solution to the problem is found by a maximiser of the following Lagrangian expression.
Q W = W T C Y W Λ W T W I d .
for Λ d × d being a diagonal d × d matrix with Lagrangian coefficients. The roots of a quadratic form are found by setting partial derivatives to zero
Q W = 2 C Y W 2 Λ W = 0 C Y W = Λ W
We see that W is a matrix which columns are eigenvectors of C Y whereas Λ is a matrix of corresponding eigenvalues with the number of non-zero elements equal to the rank of C Y . The columns of X indeed are orthogonal since
X · , i T X · , j = W · , i T C Y W · , j = W · , i T λ j W · , j = λ j W · , i T W · , j = 0
and correspond to unequal eigenvalues. It is easily proven that X , defined by W -the eigenvectors of C Y , maximizes the total trace of C X , its determinant and maximizes the Euclidean distance between the columns of X , see (Basilevsky 1994). Also, the representation minimizes the mean square error between the observation and its projection as it is equivalent problem to maximizing the variance of X .
We wish to find estimates of W and X which minimizes sum of squares, ϵ = Y X W T , both of ϵ T ϵ and ϵ ϵ T . Assuming that the residuals have homogeneous covariance matrix, that is ϵ T ϵ = σ 2 I d we have
Q ( W , X ) = ϵ T ϵ = σ 2 I d = Y X W T T Y X W T = Y T Y + W X T X W T W X T Y Y T X W T .
Since both W and X are treated as parameters to be estimated, we minimize Equation (19) by computing partial derivatives of function Q with respect to them and setting them to zero
Q W = 2 Y T X + 2 W X T X = 0
and since Y W = X
Y Y T X = Y W X T X Y Y T X = X X T X
As we are looking for uncorrelated explanatory variables, for Λ = X T X we get
Y Y T X = X Λ
which shows that X and Λ are eigenvectors and eigenvalues of the N × N matrix Y Y T . What is more, differentiating Q with respect to X gives
Q X = 2 Y W + 2 X W T W = 0 ,
what using similar arguments as above provides with
Y T Y W = W Λ ,
showing that W and Λ are eigenvectors and eigenvalues of the d × d covariance matrix of Y , C Y = Y T Y .

4.2. Stochastic Principal Component Analysis

In the following part we consider PCA from the population distribution point of view, i.e., instead of the matrix Y , we have a d -dimensional random variable y t which is linearly transformed into uncorrelated d-dimensional random variable x t . At this stage the Principal Component Analysis does not require any assumption about the distribution of a random vector y t . The only assumption we make refers to the projection matrix W and demands its orthonormality. If the random vector y t has a known mean equal to zero and covariance matrix C Y , the model transforms to
y t W = x t
and implies x t is a d-dimensional multivariate random variable with a diagonal covariance matrix Λ . If in addition we assume y t to be normally distributed, the lack of correlation imposes independence.
If we consider N realisations of the random variable y t which are placed in rows of the N × d matrix Y , we have an algebraic problem as introduced in Section 4.1. The conceptual difference is that in the case of stochastic PCA we work with an estimator of covariance matrix, e.g., a sample estimator S Y = 1 N Y T Y .

4.2.1. Extending Stochastic Principal Component Analysis to Factor Analysis

In this section we no longer assume the underlying process to be perfectly observed as would be the assumption typically made in the stochastic version of PCA above. The implication of this can be interpreted as follows: we no longer assume that the underlying time series of demographic data is perfectly observed with no observation error. Instead there is an observation error present and the covariance matrix used in the PCA (deterministic or stochastic-population estimator based analysis) no longer explains all variation in the response or the time series demographic data. This is practically important to consider in feature extraction in practice. In this section we briefly introduce this relaxation and show its relationship to stochastic PCA above.
To discuss the PCA by means of Factor Analysis we need to introduce an additional notation and variable to our model, that is, an d-dimensional error term, ϵ t , and rewrite Equation (25) as
y t = x t W T + ϵ t ,
where y t , x t and ϵ t are d-dimensional random vectors. Given N realisations of the random vector, which are placed in the rows of the matrices Y , X , ϵ respectively, the above problem has the following matrix form
Y n × d = X n × d W d × d T + ϵ n × d .
Factor analysis assumes the diagonal covariance structure of ϵ t . It differs from the PCA model discussion from the previous subsections as the components given by x t and W accounts for correlation between elements of y t and only part of the variation (in standard PCA x t and W account for the entire variance) since
E y t T y t = E x t W T + ϵ t T x t W T + ϵ t = W Λ W T + Ψ .
If we assume multivariate distribution of x t N 0 , I d and e t N 0 , Ψ we obtain conditional independence of y t given latent variable x t , i.e.,
y t | x t , W , Ψ N x t W T , Ψ .
as Ψ is diagonal. Recall that the variable x t reproduces all correlations between components of y t . Imposing normality assumptions on y t and x t enables performing ML estimation of x t , W and Ψ with optimality properties.
The marginal distribution of y t is then calculated by the integration of the joint distribution of y t and x t (which is given via chain rule)
π ( y t , x t | W , Ψ ) = π ( y t | x t , W , Ψ ) π ( x t | W , Ψ ) = 2 π | Ψ | d 2 exp { 1 2 y t x t W T Ψ 1 y t x t W T T } 2 π d 2 exp { 1 2 x t x t T }
with respect to the random variable x t , that is
π ( y t | W , Ψ ) = R d π ( y t , x t | W , Ψ ) d x t = 2 π d 2 | C | 1 exp { 1 2 y t C 1 y t T }
for C = W W T + Ψ where | C | denotes the determinant of the matrix. Hence, y t | W , Ψ N 0 , W W T + Ψ . Notice that since Ψ is diagonal, the correlation structure between components y t is specified by the matrix W .

Link to Principal Component Analysis

If we assume that the error term ϵ t is homogeneous, that is Ψ = σ 2 I d for σ 2 > 0 , then the problem of finding W by means of PCA given C = W W T + σ 2 I d is identifiable (see further discussions in (Tipping and Bishop 1999)).
Having the eigendecomposition of the covariance matrix, C = U d × d L d × d U T , for diagonal matrix L and orthonormal matrix U , we have
0 = ( C L ) U = W T W + σ 2 I d L U = W W T L σ 2 I d U .
Thus, the matrix Λ = L σ 2 I d and U are matrices of eigenvalues and corresponding eigenvectors of W W T . Since λ i = l i σ 2 0 , the scalar σ 2 can be chosen as the smallest diagonal element of Λ . Then the factors loadings are given by P = U Λ 1 2 .

PCA as a Limiting Case of Factor Analysis

The assumption of the isotropic error term is crucial in order to establish the link between Factor Analysis and PCA. Standard derivation of PCA does not account for any error term. However, we can perceive PCA as a limiting case for σ 2 0 . Then, as noted in (Roweis 1998), PCA is a limiting case of the linear Gaussian model as the covariance matrix becomes infinitesimally small and equal in all directions. This has an effect that the likelihood of a point y t is dominated by the squared residuals between the observation and its projection x t W T . As the σ 2 tends to zero, the posteriori over states x t collapses to a single point and its covariance becomes zero since
x t | y t , W , σ 2 N y t W W T W + σ 2 I d 1 , σ 2 M 1 σ 2 0 δ x t y t W W T W 1 .
The form of conditional probability x t | y t , W , σ 2 is justified in Section 5.1.

4.2.2. Missing Values

Until now, we assumed the data did not contain any missing observations. However, in many demographic time series there are numerous types of missing data. This is therefore an important aspect to address in the feature extraction.
When considering missing values we need to incorporate additional variables which describe a distribution of missing observations. Let us denote y t = ( y t o , y t m ) to be a real valued d-dimensional random vector, where y t o is a sub-vector of observed entries of y t and y t m is a sub-vector of unobserved entries, i.e., missing. The indicator random variable r t decides which entries of y t are missing denoting them by 1, otherwise 0. Recall, that a single observation consists of the pair y t o , r t with distribution parameters Θ , Θ r respectively. We assume the parameters to be distinct. The likelihood of parameters is proportional to the conditional probability y t o , r t | Θ , Θ r that is
π y t o , r t | Θ , Θ r = π y t o , y t m , r t | Θ , Θ r d y t m = π r t | y t , Θ , Θ r π y t | Θ , Θ r d y t m
In our study, we assume the pattern of missing data to be MAR-missing at random as defined in (Little and Rubin 2002). The assumptions imposes the indicator variable r t to be independent of of the value of missing data. Then the vector y t which is MAR satisfies
π ( r t | y t , Θ ) = π ( r t | y t o , Θ ) .
what results in
π y t o , r t | Θ , Θ r = π r t | y t o , Θ r π y t | Θ d y t m = π r t | y t o , Θ r π y t o | Θ
Under the MAR assumption, the estimation of Θ via maximum likelihood of the joint distribution y t o , r t | Θ , Θ r is equivalent to the maximisation of the likelihood of the marginal distribution y t o | Θ . Hence, we do not worry about the distribution of the indicator random variable r t and the joint distribution of y t o and r t . If the assumption about MAR does not hold, one needs to solve the integral from Equation (34) in order to maximize the joint likelihood.

5. Efficient Probabilisitic PCA Feature Extraction in the Presence of Missingness via EM Algorithm

The combination of PCA with missingness and the Factor Analysis leads us to the “Probabilistic PCA” which can be estimated via Expectation Maximisation framework as described below. This is an exceptionally efficient and numerically stable approach to apply in practice.
Let us consider d × 1 vector of observable demographic data that we wish to extract features from, where it is denoted at time t by vector y t . We seek k-dimensional uncorrelated latent vector x t which provides the most meaningful model of y t ,
y t = x t W d × k T + ϵ t
We aim for W T W = I k (i.e., orthonormality of the projection matrix), however it is not assumed in the estimation process. We assume the multivariate normal priori distributions of the k dimensional latent variable x t N 0 , I k and the error term ϵ t N 0 , σ 2 I d . Given N realisations of observable variable y t , the sample model has a form
Y N × d = X N × k W d × k T + ϵ N × d
where single realisations are placed in rows of Y , X and ϵ , respectively.
Our goal is to estimate coefficient matrix W , scalar σ 2 and filter realisations of latent variable x t employing Expectation-Maximisation (EM) algorithm. The steps and derivation of the algorithm have been described in (Rubin and Thayer 1982) where no missigness in assumed. The authors use the results introduced by (Dempster et al. 1977) for factors treated as missing data. The EM algorithm uses the complete data logliklehood, i.e., the logarithm of the likelihood of y t , x t | W , σ 2 Equation (30) given by
L y t , x t | W , σ 2 ( σ 2 , W ; y 1 : N , x 1 : N ) = n = 1 N π y n , x n | W , σ 2
for y n = [ Y ] n , · , and maximizes the expression Equation (39) which is integrated with respect to the unobserved values of x t . The algorithm is summarized by the following two steps
1. 
Expectation step: Expectation of the loglikelihood function of the join distribution of y t , x t given by Equation (30) with respect to the conditional distribution x t | y t , W , σ 2
Q W , σ 2 | W * , σ * 2 = E x t | y t , W , σ 2 log L y t , x t | W , σ 2 ( σ * 2 , W * ; y 1 : n , x 1 : n )
2. 
Maximisation step: Finding W * and σ * 2 that maximize Q W , σ 2 | W * , σ * 2
W * , σ * 2 = argmax W * R d × k , σ * 2 > 0 Q W , σ 2 | W * , σ * 2
The Expectation step (E-step) provides with the expectation Equation (40) of complete data likelihood Equation (39) based on y 1 : N and assumes W and σ 2 to be known. It uses the observed data, current estimates of parameters and the distribution of missing values conditioned on these elements. The Maximisation step (M-step) maximizes the expectation Equation (40) with respect to W * and σ * 2 as if it was based on complete data information. In the paper (Dempster et al. 1977), the author proofs that the loglikelihood Equation (39) is non-decreasing on each iteration of the algorithm and provides with conditions that ensure its convergence (Theorem 1 and Theorem 2 in paper (Dempster et al. 1977), respectively).
We derive the steps of the algorithm using the assumptions of the normality of y t and x t given at the beginning of the section. As mentioned in (Dempster et al. 1977), the convexity of regular exponential families (where normal distribution belongs to) ensures the uniqueness of the maximizers computed in the M-step. Also, the normal distribution provides us with the closed forms of the moments used in subsequent steps of the algorithm. Hence, it simplifies the computations.

5.1. Expectation Step and Its Maximum

Finding the expectation Equation (40) requires specifying the conditional distribution of x t given observations y t and parameters. It is given via Bayes’ rule as
π ( x t | y t , W , σ 2 ) = π ( y t | x t , W , σ 2 ) π ( x t | W , σ 2 ) π ( y t | W , σ 2 )
and results in x t | y t , W , σ 2 N y t W M 1 , σ 2 M 1 for M = W T W + σ 2 I k . Given N realisations of y t , the expectation of the logliklihood with respect to the conditional distribution of x t is equal to
Theorem 1.
The expectation of the E-step, E x t | y t , W , σ 2 log L y t , x t | W , σ 2 ( σ * 2 , W * ; y 1 : N , x 1 : N ) , is given by
Q W , σ 2 | W * , σ * 2 = R k π ( x t | y t , W , σ 2 ) log n = 1 N π y n , x n | W * , σ * 2 d x t = n = 1 N d 2 log σ 2 + 1 2 t r E x n T x n | y t , W , σ 2 + 1 2 σ 2 y n y n T 1 σ 2 E x n | y t , W , σ 2 W * T y n T + 1 2 σ 2 t r W * T W * E x n T x n | y t , W , σ 2
for the corresponding moments of the conditional distribution x n | y t , W , σ 2
E x n | y t , W , σ 2 1 × k = y n W M 1 E x n T x n | y t , W , σ 2 k × k = σ 2 M 1 + E x n | y t , W , σ 2 T E x n | y t , W , σ 2
Proof. 
Please refer to the results of (Rubin and Thayer 1982). ☐
The M-step of EM algorithm uses the computed expectation and maximizes it with respect to the static parameters W * and σ * 2 . The maximizes are given by
Theorem 2.
The maximizers of Q W , σ 2 | W * , σ * 2 are the solution to the set of the problems Q W * = 0 and Q σ * 2 = and are given by
W d × k * = n = 1 N y n T E x n | y t , W , σ 2 n = 1 N E x n | y t , W , σ 2 T E x n | y t , W , σ 2 1 σ * 2 = 1 N d n = 1 N y n y n T 2 y t W * E x n | y t , W , σ 2 T + t r E x n | y t , W , σ 2 T E x n | y t , W , σ 2 W * T W * .
The iteration over E-step and M-step provided by Theorem 1 and 2 can be replaced by iterations over one combined step, as noted in (Tipping and Bishop 1999), which for iterations i and i + 1 is given by
1 : W ( i + 1 ) = S W ( i ) σ ( i ) 2 I k + M 1 W ( i ) T S W ( i ) 1 2 : σ ( i + 1 ) 2 = 1 d t r S S W ( i ) M 1 W ( i + 1 ) T
until the convergence of Q W ( i ) , σ ( i ) 2 | W ( i + 1 ) , σ ( i + 1 ) 2 when M = W ( i ) T W ( i ) + σ ( i ) 2 I k and S = 1 N Y T Y .

5.2. The Maximum Likelihood Estimation—The Convergence of EM Algorithm

The work of (Dempster et al. 1977) proves that EM algorithm for the normal distribution always converges to local maximum. Recall that the solution provided by EM algorithm converges to the solution obtained by maximizing the likelihood of marginal distribution y t | W , σ 2 , given by
L y t | W , σ 2 W , σ 2 ; y 1 : n = N 2 d log ( 2 π ) + log | C | + t r C 1 S
where S d × d = 1 N n = 1 N y n T y n and C = W W T + σ 2 I d . The MLE estimator of W given by the likelihood Equation (47) is a solution to the following fix point equation
L y t | W , σ 2 W = N C 1 S C 1 W C 1 W = 0 W = S C 1 W
We may distinguish tree possible cases to the above solution:
Case 1:
W = 0
The solution to this case is treated as a minimum of the log-likelihood.
Case 2:
S = C
The equality implies that d k smallest eigenvalues of S be equal to σ 2 and the problem is indefinable since W W T = S σ 2 I d . Given the eigendecomposition of S , that is
S = U d × d Λ d × d U d × d T
for orthonormal matrix U such that U T U = I d and diagonal matrix Λ with non-negative entries, the matrix W is equal to W = U Λ σ 2 I d 1 2 R T where R k × d is an arbitrary rotation matrix. The case is proven in Section 4.2.1.
Case 3:
W 0 and S C
In order to compute the solution to Equation (48) we use the singular value decomposition of W , that is
W = V d × k L k × k R k × k T
where V and R are real valued orthogonal matrix by columns, and L is non-negative diagonal matrix. Using C = W W T + σ 2 I d we apply the above facts to the problem defined by the fix point equation
W d × k = S C 1 W V L R T = S V L R T R L V T + σ 2 I d 1 V L R T V L R T = S V L 2 V T + σ 2 I d 1 V L R T V L 2 V T + σ 2 I d = S U L 2 L + σ 2 I d V L = S V L V L 2 + σ 2 I k L = S V L
Notice that
S v j = ( σ 2 + l j 2 ) v j
where v j = [ V ] , j and l j = [ L ] j , j . Hence, the vectors v j are eigenvectors of the estimated covariance matrix S . Using the eigendecomposition of S given by Equation (49), we see that v j corresponds to the eigenvectors of S , u j with eigenvalues λ j = l j 2 + σ 2 . Since L has a different dimension than Λ , we express it as
L = ( K σ 2 I k ) 1 2
where [ K ] j , j = λ j is an j eigenvalue of S and corresponds to the j eigenvector of S , u j . In a case when l j = 0 , the eigenvector v j is arbitrary and [ K ] j , j = σ 2 . The scalar σ 2 is estimated as average of i > k eigenvalues of S .
The remaining question is if the EM algorithm converges to the global maximum. If the stationary points of the likelihood with respect to W which are spanned by minor eigenvectors (the eigenvectors with corresponding negligible eigenvalues), are stable, then the convergence is not guaranteed. However, we can show that any eigenvectors which does not correspond the highest eigenvalues of S is a saddle point of the logliklihood and does not provide stable solutions. For the detailed discussion please refer to (Tipping and Bishop 1999). The authors highlight the case when all d k discarded eigenvalues are equal to the smallest major principal eigenvalue(s). They show that such a situation provides with maximum spanned by principal eigenvectors and noise distribution (corresponding to the smallest principal eigenvalue(s), which become(s) zero).

5.3. The EM Algorithm for Incomplete Data

Until now, we developed the EM algorithm for Probabilistic PCA under the assumption that the data does not contain any missing values. In Section 4.2.2 we introduced the background of how we are going to treat missing entries of the observation. We assume them to be Missing-At-Random what allows us to ignore the existence of indicator variable r t while estimating W and σ 2 . Given the vector of an observation with missing entries y t = y t o , y t m , EM algorithms treats y t m as an additional latent variable to x t in the model Equation (37). The expectation of the joint logliklihod Equation (30) is computed with respect to the conditional distribution of x t , y t m | y t o , W , σ 2 and provides with the two following steps:
1. 
Expectation step: Expectation of loglikelihood function of join distribution of y t , x t | W , σ 2 given by Equation (30) with respect to conditional distribution x t , y t m | y t o , W , σ 2
Q m W , σ 2 | W * , σ * 2 = E x t , y t m | y t o , W , σ 2 log L y t , x t | W , σ 2 ( σ * 2 , W * ; y 1 : n , x 1 : n )
2. 
Maximisation step: Finding W * and σ * 2 that maximize Q m W , σ 2 | W * , σ * 2
W * , σ * 2 = argmax W * R d × k , σ * 2 > 0 Q m W , σ 2 | W * , σ * 2
We need to specify the moments of a conditional distribution of latent variables given the observation vector, when we include the latent variable y t m . The conditional distribution x t , y t m | y t o , W , σ 2 is obtained via Bayes’ rule as
π x t , y t m | y t o , W , σ 2 = π x t | y t , W , σ 2 π y t m | y t o , W , σ 2
Given N realisation of y t with arbitrary missing entries, the expectation step has a form
Q m W , σ 2 | W * , σ * 2 = E x t , y t m | y t o , W , σ 2 log L y t , x t | W , σ 2 ( σ * 2 , W * ; y 1 : N , x 1 : N ) = R k × R d π ( x t , y t | y t o , W , σ 2 ) log n = 1 N π y n , x n | W * , σ * 2 d x t d y t = n = 1 N d 2 log σ * 2 + 1 2 t r E x n T x n | y t o , W , σ 2 + 1 2 σ * 2 t r E y n T y n | y t o , W , σ 2 1 σ * 2 t r W * E x n T y n | y t o , W , σ 2 + 1 2 σ * 2 t r W * T W * E x n T x n | y t o , W , σ 2
where E x n T x n | y t o , W , σ 2 are derived in Equation (44) and need adjustment for missing data. The other moments of the conditional distribution x t , y t | y t o , W , σ 2 need to calculated.

5.3.1. The Moments of Joint Distribution x t , y t m | y t o , W , σ 2 .

The first component of the conditional probability Equation (56) is given by Equation (42). For simplicity assume for a moment y t = ( y t o , y t m ) N 0 d , C d × d for a covariance matrix
C d × d = C o o C o m C m o C m m
where indexes o and m correspond to the locations of observed and missing entries of the random vector y t . As shown in (Little and Rubin 2002), the joint distribution y t | y t o under MAR assumption is multivariate normal, that is
y t | y t o N y t o y t o C o o 1 C o m , 0 0 0 C m m C m o C o o 1 C o m .
since
π y t m | y t o = π y t m , y t o π y t o
As derived in (Jamshidian 1997), the covariance matrix of the marginal distribution y t | W , σ 2 is equal to
C = W o W o T + σ 2 I d o W o W m T W m W o T W m W m T + σ 2 I d m
where d o and d m such that d o + d m = d are numbers of elements observed and missing (which can be zero) respectively, m and o are the indexes of matrices denote sets of rows which correspond to missing and observed values of y t , respectively (recall that columns of matrix W correspond to values of x t ).
Having above in mind, we can compute the following moments of the conditional distribution x t , y t | y t o , W , σ 2 are given by an alternative theorem to the Theorem 1 which accounts for the incomplete data case.
Theorem 3.
The expectation of the E-step, E x t | y t , W , σ 2 log L y t , x t | W , σ 2 ( σ * 2 , W * ; y 1 : n , x 1 : n ) , where y t = ( y t o , y t m ) , is given by
Q m W , σ 2 | W * , σ * 2 = R k × R d π ( x t , y t | y t o , W , σ 2 ) log n = 1 N π y n , x n | W * , σ * 2 d x t d y t = n = 1 N d 2 log σ * 2 + 1 2 t r E x n T x n | y t o , W , σ 2 + 1 2 σ * 2 t r E y n T y n | y t o , W , σ 2 1 σ * 2 t r W * E x n T y n | y t o , W , σ 2 + 1 2 σ * 2 t r W * T W * E x n T x n | y t o , W , σ 2
for the corresponding moments of the conditional distribution x n | y t o , W , σ 2
E y n | y t o , W , σ 2 1 × d = y n o E y n m | y t o , W , σ 2 E y n T y n | y t o , W , σ 2 d × d = 0 0 0 C m m W m W o T C o o 1 W o W m T + E y n | y t o , W , σ 2 T E y n | y t o , W , σ 2 E x n | y t o , W , σ 2 1 × k = E y n | y t o , W , σ 2 W W T W + σ 2 I d 1 E x n T x n | y t o , W , σ 2 k × k = σ 2 W T W + σ 2 I d 1 + E x n | y t o , W , σ 2 T E x n | y t o , W , σ 2 E x n T y n | y t o , W , σ 2 k × d = 0 W m W m W o T C o o 1 W o + E x n | y t o , W , σ 2 T E y n | y t o , W , σ 2
Proof. 
We can find the corresponding steps of calculation in (Jamshidian 1997). ☐

5.3.2. The Maximizers of Q m W , σ 2 | W * , σ * 2

The M-step of EM algorithm uses the computed expectation defined in Theorem 3 and maximizes it with respect to the static parameters W * and σ * 2 . The corresponding values of the maximizes are given by
Theorem 4.
The maximizers of Q m W , σ 2 | W * , σ * 2 are the solution to the set of the problems Q m W * = 0 and Q m σ * 2 = and are given by
W d × k * = n = 1 N E x n T y n | y t o , W , σ 2 T n = 1 N E x n | y t o , W , σ 2 T E x n | y t o , W , σ 2 1 σ * 2 = 1 N d n = 1 N t r E y n T y n | y t o , W , σ 2 2 W * E x n T y n | y t o , W , σ 2 + E x n | y t o , W , σ 2 T E x n | y t o , W , σ 2 W * T W *
Proof. 
We need to replace the moments of conditional distribution x t | y t , W , σ 2 in the proof of the Theorem 2 with the corresponding moments of x t | y t o , W , σ 2 derived in Theorem 3. Also, we need to replace terms of y t by its moments related to the joint distribution x t , y t | y t o , W , σ 2 also given in Theorem 3. ☐

5.4. The Algorithm

The steps of computing eigenvectors and corresponding loadings are summarized in the following algorithm. Firstly, we standardized data using information from observed values stored in Y N × d . The estimator of location and variance, Θ N = μ ^ , σ ^ 2 is a function of the non missing values of observation vector y among its N realisations. We execute PPCA on standardized data following two steps: the expectation and maximisation step.
Algorithm 1 Probabilistic Principal Component Analysis with missing values
1: for j = 1 , , d do
2:   Compute Θ N Y o · , j = ( μ ^ j , σ ^ j 2 )
3:   Standardize data Y ˜ o · , j = Y o · , j μ ^ j σ j
4: end for
5: Y ˜ m = 0 and Y ˜ = Y o ˜ , Y ˜ m
6: Initialise: ε , i = 0 , W ( 0 ) = W 0 , σ 2 ( 0 ) = σ 0 2 ,
7: repeat
8:   E-step: Compute corresponding moments from Equation (62) for Q m W ( i ) , σ 2 ( i ) | W ( i ) , σ 2 ( i )
9:   M-step: Compute maxima of Q m W ( i ) , σ 2 ( i ) | W , σ 2 from Equation (64):
     W ( i + 1 ) , σ 2 ( i + 1 ) = argmax W R d × k , σ 2 > 0 Q m W ( i ) , σ 2 ( i ) | W , σ 2
10:    i = i + 1
11: until a convergence criterion is satisfied

6. Statistically Robust Feature Extraction for Stochastic Principal Component Analysis

Until this point, we have assumed that any stochastic noise or observation errors in the demographic data is in some sense “well behaved”, for instance: additive, light tailed, symmetric and zero mean. In this section we relax this inherent assumption by developing a class of robust estimators that can withstand violations of such assumptions which routinely arise in real data observations, especially as we will demonstrate in demographic data. Furthermore, we have assumed that the data is generally temporally stationary over the time period of study. If any of these assumptions does not hold then this has an influence from a statistical perspective for the real data analysis. In such cases we recommend to resort to implementation of feature extraction methods which are more robust (in a statistical sense) to violations of such features.
When non-robust feature extraction methods are naively utilised in the presence of violations of these implicit statistical assumptions it can lead to misleading feature extraction and falsify the information content of these features, leading to bias or variance enhancements in the forecast from stochastic mortality models incorporating such features.
Therefore, it is critical to ensure that the feature extraction is appropriately performed. To avoid or to robustify the feature extraction techniques presented previously against violations of such statistical features such as non-stationarity, heavy tails, hetroskedascity, non-Gaussianity one can turn to robust statistical methods. This can strongly influence the findings based on such demographic feature extractions. Therefore, in this section we demonstrate a statistically rigorous approach to perform feature extractions as detailed previously in a robust estimation and feature extraction extended framework.
To achieve this, we first recall some basics of robust statistical inference, primarily targeting the robust estimation of location and scale or mean and covariance, as will be directly relevant in the stochastic PCA based methods proposed above.
Regardless if we work with standard or probabilistic PCA, the most straightforward method to improve its statistical robustness, is to employ estimators of the covariance matrix which are less sensitive to outlying data points.

6.1. Robust Estimators of Mean and Covariance Matrix

To introduce the concept of statistically robust estimation for feature extraction in demographic data, we first introduce what exactly we mean by statistical robustness of feature extraction. This requires a short set of formal definitions.
Let us define an estimator Θ as a functional on the domain of distribution functions. We exchangeable use the definition of an estimator as a function of the d-dimensional sample y 1 , , y N , denoted by Θ N . In the following part we drop the time related index of the random variable and denote it by the y . The empirical distribution defined by sample is denoted by F N . The true population distribution and density functions of y are denoted by capital letter F and f.

6.1.1. Concept of Robustness

We consider robustness according to two measures: a measure of local robustness and secondly a measure of global robustness. The two measures are defined by a ϵ -contamination set of a distribution functions, that is
Definition 1.
F ϵ is a contamination neighbourhood of distribution F defined as
F ϵ = { G : G = ( 1 ϵ ) F + ϵ H , for H any distribution }
given fraction of contamination 0 ϵ 1 .
One can then define the local robustness of an estimator Θ as measured by influence function given in Definition 2.
Definition 2 (Influencefunction).
The influence function of an estimator Θ on the domain of distribution function is defined as
I F ( x 0 , Θ , F ) = lim ϵ 0 Θ ( 1 ϵ ) F + ϵ δ x 0 Θ ( F ) ϵ
where δ x 0 is a probability measure which puts mass 1 at the point x 0 if ( 1 ϵ ) F + ϵ δ x 0 is included into the domain of Θ.
The influence function is a crucial information to calculate asymptotic variance and efficiency of an estimator as
n Θ ( F N ) Θ ( F ) = N R d I F y , Θ , F d F N ( y ) +
where we used Tylor expansion of the empirical distribution function F N around true population distribution function F.
The influence function provides us with the knowledge how ϵ contamination on a point x 0 changes the information about the true distribution of the random variable y which is given by Θ . Thus, it is perceived as a local measure of robustness. To measure global robustness, one can examine a breakdown point ϵ * of estimator Θ N at the true population distribution F, given in Definition 3.
Definition 3 (Breakdown point).
The finite-sample breakdown point ϵ * of an estimator Θ N at the true population distribution F is defined as
ϵ * : = sup { ϵ 1 : sup G F ϵ | Θ N ( G ) | < }
Intuitively, it is understood as the maximal contamination which does not cause the estimator to loose valid information about the true distribution F. We may define a finite sample definitions of the breakdown point as
Definition 4.
The breakdown point ϵ N * of estimator Θ at the empirical population distribution F N is defined as
ϵ N * ( Θ , y 1 , , y N ) : = max { n 1 Z , max i 1 , , i n 1 sup z 1 , , z n 1 | Θ ( F ˜ N ) | < }
where points y i 1 , , y n 1 were replaced with arbitrary points z 1 , , z n 1 .
Having introduced these formal definitions of what exactly is meant by robust estimators, we overview the most frequently used estimators of a covariance matrix with respect to their robust characteristics according to introduced measures. Let us denote μ , C as a mean and covariance matrix of y .

6.1.2. M-Estimators

In the study of (Maronna 1976) and (Huber and Ronchetti 2009) on the robust estimation of covariance matrix, the authors define one of the first classes of robust estimators, called M-estimators, which are a generalized version of Maximum Log-liklihood Estimators (MLE) where
Θ N = a r g max Θ Ω n = 1 N f ( y n ) Θ N = a r g min Θ Ω n = 1 N log f ( y n ) .
The idea behind M-estimators, is to replace the density function in Equation (70), f, with function ρ : R + R which down weights outliers, that is
Definition 5 (M-estimators).
The M-estimator of a parameter Θ is defined as a solution to the problem
Θ N = argmin Θ Ω n = 1 N ρ ( d n ) .
for a function ρ : R + R where d n 2 = y n μ C 1 y n μ T is a Mahalanobias distance of y n .
Remark 3.
If ρ is a continuous function, denoting its derivative ρ = ψ , the estimator Θ ^ N satisfies
n = 1 N ψ ( d n ) d n C 1 2 y n μ = 0 n = 1 N ψ ( d n ) d n C 1 2 y n μ T y n μ C 1 2 = 0
which are the robust analogue of the typical normal type equations one would solve in MLE estimations in regression for instance.
Remark 4.
If we additionally assume that y is a random vector from elliptical family with density of the form
f y ; μ , C = c 1 | C | 1 2 g d 2 y , Θ .
with g : R + R + being a density generator of random variable y , the solution to the problem Equation (71) is equivalent to
n = 1 N ψ ( d n ) d n C 1 2 y n μ = 0 n = 1 N ψ ( d n ) d n C 1 2 y n μ T y n μ C 1 2 = I d
More generalized notation is used in (Maronna 1976)and (Huber and Ronchetti 2009) who introduce functions u 1 , u 2 : R + R to rewrite Equation (74) as follows in Definition 6.
Definition 6.
The M-estimator Θ N of the parameter Θ of a location and scatter of random variable y from elliptical family, is defined as a solution to
n = 1 N u 1 d n y n μ = 0 n = 1 N u 2 d n 2 C 1 2 y n μ T y n μ C 1 2 = I d
for functions u 1 , u 2 : R + R , given its N realisations.
The authors provide conditions for u 1 and u 2 to ensure existence and uniqueness of Equation (75) and its normal asymptotic distribution with 1 2 convergence. (Maronna 1976) proves the breakdown point of Equation (75) to be very sensitive to dimensionality of the data as ϵ * = 1 d + 1 .
Recall that in this study, we want to compute a variance of every variable rather than a covariance matrix (Algorithm 1) what makes a M-estimator a very suitable tool. What is more, the robustness of M-estimators does not depend on the sample size. It is another advantage of an M-estimator when working with population data as has a very limited number of observation available.
Remark 5.
As an example of function u 1 and u 2 , (Huber 1964) gives the following
u 1 ( s ) = ψ H ( s , k ) s , u 2 ( s ) = ψ H ( s , k 2 ) s , ρ H s = 1 2 s 2 | s | < k 1 2 | s | 1 2 k 2 | s | k
for ψ H ( s , k ) = ρ H ( s , k ) and the tuning constant k.
Defining estimator with ρ H is proven to have the minimal maximal asymptotic variance of all affine invariant estimators under the assumption that the sample y 1 , , y N is normally distributed with zero mean and identity covariance matrix.
For the problem of only scatter estimation, (Tyler 1987) introduced a function u 2 ( s ) = d s , which is investigated in details by (Frahm and Jaekel 2010).
Definition 7.
Tyler’s estimator of an covariance matrix for unbounded function u 2 ( s ) = d s is obtained by solving
d N n = 1 N y n μ T y n μ y n μ C 1 y n μ T = C .
Lemma 1.
Under the assumption of generalized elliptical distribution of y , the solution Equation (77) exists and is unique up to the scalar parameter.
Recall that u 2 is not bounded and hence does not satisfy the conditions described in (Maronna 1976) or (Huber and Ronchetti 2009). Therefore the uniqueness can be obtain up to the scaling parameter.
Lemma 2.
if y belongs to generalized elliptical distribution, Tyler’s estimator from Definition 7 is strongly consistent with true covariance matrix, if it exists, up to the scaling factor and has normal asymptotic distribution with convergence rate 1 2 .
The asymptotic variance of Tyler’s M-estimator, which was analysed i.e., by the author of (Tyler 1987), is proven to have the lowest maximum bias among all normally distributed estimators for symmetric elliptical family random variables.
For a point contamination, the maximal breakdown point of the estimator is equal to ϵ * = 1 d as proved in (Tyler 1987). The later study shows that for any other contamination, the maximal breakdown point is between
1 d + 1 ϵ * 1 d
Furthermore, Tyler’s M-estimator is an MLE estimator of angular Gaussian distribution. To show this property we use the following Lemma
Lemma 3.
If y has generalized elliptical distribution then y = d μ + R C 1 2 u for u being uniformly distributed random variable on the unit d dimensional sphere and R being a scalar variable.
Recall that R is a component which is responsible for generating tails of y distribution.
Remark 6.
If R is absolutely continuous and independent from u , the generator of probability density function of centred y , y ˜ , is given by f y ˜ : = Γ r 2 2 π r 2 y ˜ r 1 2 f R ( y ˜ ) for r = r a n k ( C ) and f R being probability density function of R. Then y is symmetrically distributed.
Remark 7.
If the variable R is allowed to be negative and has a dependence structure with uniformly distributed u , the distribution of y is called generalized elliptical distribution. The generalised elliptical family allows to model asymmetric and tail dependent distribution of y .
Theorem 5 (Distribution-free Tyler estimator).
The estimator introduced by Definition 7 is distribution-free, i.e., it does not depend on values of R.
Proof. 
Following Lemma 3 we can notice that
z : = y μ | | y μ | | F = R u C 1 2 | | R u C 1 2 | | F = d s i g n ( R ) u C 1 2 | | u C 1 2 | | F .
and is robust against extreme realisations of R as does not depend on the values of R. | | · | | F denotes the Frobenius norm. Rewriting Equation (77) using N realisations of z , z 1 , z N , we obtain
C = d N n = 1 N z n T z n z n C 1 z n T = d N n = 1 N C 1 2 u n T u n C 1 2 u n T u n
 ☐
Tyler’s estimator is invariant under any change of distribution of R i.e., it is distribution-free.
Theorem 6.
The estimator introduced by Definition 7 is MLE estimator of Angular Gaussian Distribution.
Proof. 
Since the probability density function of v = u C 1 2 | | u C 1 2 | | F is given by
f v ( v ) = Γ d 2 2 π d 2 | C 1 | 1 2 | | v C 1 2 | | F d .
Having N realisations of v , v 1 , , v N , the logliklihood function is given by
log L ( v 1 , , v N ; C ) = n = 1 N log f v ( v n ) = const + N 2 + log | C 1 | d 2 n = 1 N log v n C 1 v n T = const + N 2 + log | C 1 | d 2 n = 1 N log z n C 1 z n T
Since P d is an open set, the maximizer of log L with respect to C is the stationary point. By setting
log L C 1 = 0 N 2 2 C d i a g C d 2 n = 1 N 2 s n T s n d i a g ( y ˜ n T y ˜ n ) y ˜ n C 1 y ˜ n T = 0 N C d n = 1 N y n μ T y n μ y n μ C 1 y n μ T = 0
what is precisely equal to Equation (77). ☐
The solution to the estimation problem of Tyler’s estimator is in a form x = f ( x ) what allows us to use the fix-point iteration scheme with i + 1 step
C ( i + 1 ) = d N n = 1 N y n μ T y n μ y n μ C ( i ) ( 1 ) y n μ T .
Lemma 4.
The fix-point algorithm Equation (84) converges to a C for a scalar a > 0 .
Proof. 
Since f ( C ) = d N n = 1 N y n μ T y n μ y n μ C 1 y n μ T being continuous on P d , the algorithm converges to C for a scalar a > 0 . ☐
Following (Tyler 1987), let us define the following function M : P d P d such that
M ( Γ ) = d N n = 1 N Γ 1 2 y n μ T y n μ Γ 1 2 y n μ Γ y n μ T
The fixed point iteration step Equation (84) can be rewritten as
C ( i + 1 ) = C ( i ) 1 2 M C ( i ) ( 1 ) C ( i ) 1 2
Denoting Γ ( i ) = C ( i ) ( 1 ) and M ( k ) = M Γ ( i ) we get
Γ ( i + 1 ) = Γ ( i ) 1 2 M ( i ) ( 1 ) Γ ( i ) 1 2
In order to deal with the lack of uniqueness of the solution to Equation (84), (Tyler 1987) restricts the search space to the positive definite symmetric matrices with trace equal to 1 by
Γ ( i + 1 ) = Γ ( i ) 1 2 M ( i ) ( 1 ) Γ ( i ) 1 2 t r Γ ( i ) 1 2 M ( i ) ( 1 ) Γ ( i ) 1 2 = Γ ( i ) 1 2 M ( i ) ( 1 ) Γ ( i ) 1 2 t r Γ ( i ) M ( i ) ( 1 ) .
That way, it is ensured that t r ( Γ ( i + 1 ) ) = 1 . In his study, Tyler proves that the convergence of the sequence Γ ( i + 1 ) to a non singular matrix Γ implies M Γ = I d .
Theorem 7
(Tyler 1987) If the following conditions hold:
1. 
the sample y 1 , , y N does not contains values equal to μ
2. 
the empirical distribution measure F N of the sample satisfies F N S < 1 d , for S being a any proper subspace of R d
3. 
for some the mth smallest eigenvalue of Γ ( i ) , it is holds that λ m , d > d r a n k S F N S where S is any proper subspace of R d
then Γ ( i ) Γ and M Γ = I d .
Remark 8.
Generally, to deal with lack of uniqueness of Equation (84), a common practice is to impose additional constraints such as | C | = 1 as in (Frahm and Jaekel 2010) or t r C = 1 as in (Tyler 1987) or (Sun et al. 2016).

6.1.3. S-Estimators

As the robustness of the class of M-estimators is highly influenced by dimensionality of the data, we introduce an extension to the problem Equation (71) and define class of S-estimator. The class was firstly introduced by (Rousseeuw and Yohai 1984) and extended to multivariate setting by (Davies 1987).
Definition 8 (S-estimator)
S-estimator of a parameter Θ is defined as a solution to
Θ N = min Θ Ω | C | subject to ρ d ( y , Θ ) d F N y b 0
where the constant b 0 is a mean of ρ ( d y ) , where d y is a Mahalanobias distance of the random vector y under an assumption of the distribution of y , that is
b 0 = E ρ d y = E ρ | | y μ | | F = π d 2 Γ d 2 0 ρ r f r r d 1 d r .
Function ρ is defined as in Definition 5.
Remark 9.
If ρ is continuous then the problem Equation (89) is an equality.
In the study of (Davies 1987), the author investigates the properties of S-estimators under the assumption of elliptically distributed y . (Davies 1987) gives the general assumptions on function ρ : R + [ 0 , 1 ] , e.g., being continuous on its domain and zero c such that 0 < c < . The author proves that under these assumptions and if n b 0 d + 1 , Equation (89) has at least one solution for non singular estimate Θ = μ , C .
The work of (Davies 1987) proves consistency of the estimator Θ N and its uniqueness under these assumptions. When additionally ρ has a continuous third derivative, the solution to Equation (89), is asymptotically normal with 1 2 convergence.
Let us denote ϵ = 1 b 0 . Following (Davies 1987), if N 1 2 ϵ N + 1 , and the sample is in general position, that is no more that d points of the sample lies on ( d 1 ) -dimensional hyperplane, then the finite-sample breakdown point is equal to
ϵ N * = [ N ϵ ] + 1 N lim n ϵ n * = ϵ = ϵ * .
In the paper (Lopuhaa 1989), there is introduced an alternative definition of S-estimators, considering function ρ : R + [ 0 , ) and investigates the relation between S-estimators and M-estimators. He imposes stronger assumptions on the function ρ than are required to make his definition of S-estimators equivalent to Davis’s definition (Davies 1987) by transforming ρ 1 ρ sup ρ . Under the assumption of elliptical distribution of y , the author shows that any solution to the problem Equation (89) satisfies Equation (71). He rewrites the problem Equation (89) as the similar to Equation (74) using Lagrangian
log L N Θ , λ = log | C | λ 1 N n = 1 N ρ d n b 0
to obtain the equivalent set of equations
Θ N , λ = argmin Θ Ω , λ R log L N Θ , λ 1 N n = 1 N u d n y n μ = 0 d N n = 1 N u d n v d n y n μ T y n μ = C ,
for u ( s ) = ψ ( s ) s and v ( s ) = ψ ( s ) s ρ ( s ) + b 0 . The term ρ ( s ) + b 0 substitutes Lagrangian multiplier  λ . Hence, every solution to Equation (89) is a solution to Equation (93) what is M-estimator problem defined by Equation (75).
The author of (Lopuhaa 1989) argues that S-estimators achieve the same asymptotic variance as corresponding M-estimators but as dimensionality of data increases, they have higher breakdown point than M-estimators.
Remark 10.
The example of functions which satisfies the conditions of uniqueness and existence of Equation (89) are Tuckey’s biweight functions, that is
ρ B s = s 2 2 s 4 2 k 2 + s 6 6 k 4 | s | k k 2 6 | s | k ψ B s = s 1 s k 2 2 1 [ k , k ] ( s )
The breakdown point of the S-estimator for ρ = 1 ρ B sup ρ B is equal to
ϵ * = [ N 1 b 0 ] + 1 N = N sup ρ B E ρ B y μ + 1 N

7. Data

The examined data consists of male and female mortality and demographic data obtained from Human Mortality Database (http://www.mortality.org) for European countries. The Table 1 summarizes the availability of the data for all the countries included into the study.
We use four different sets of mortality data, raw data: Birth counts and Death counts, and life tables: Life Expectancy at Birth and Death Rates. We conduct separate analysis for female and male populations.
The time series vary in terms of the number of available observations. The longest time series is provided by Swedish and French mortality data, starting from 1751 and 1816, respectively. The shortest time series are given for Greece and Slovenia, 1983–2014 and 1981–2013, respectively.
With regards to Birth counts and Life expectancy at Birth, the information per country in time point is one dimensional, i.e., annual counts of live births by sex in year t (Birth counts) or the expected life span of a person born in year t (Life Expectancy at Birth). Hence, a single observation in these cases consists of the number of entries equal to the number of countries included into the study, that is 31 listed in Table 1, per gender.
The age specific information is provided for Death counts and Death Rates. A single observation per country in time t describes a number of deaths of people with ages from 0 to 110 + (Death counts) or number of deaths for ages from 0 to 110 + scaled to the size of that population, per unit of time (Death Rates). The availability of time series is different among age groups. Usually, the shortest time series are collected for age groups above 100 years.
Since the Lee-Carter model with the cohort effect has been already reported to be prone to over-fitting when fitted to short time series (the currently available mortality data is classified as a short time series), we decide to work with a data aggregated in the format “ 5 × 1 ”, i.e., by 5-year age group per calendar year. The ages are grouped into following stratifications: 0, 1–4, 10–14, 15–19, 20–24, 25–29,   30–34, 35–39, 40–44, 45–49, 5–9, 50–54, 55–59, 60–64, 65–69, 70–74, 75–79, 80–84, 85–89, 90–94, 95–99, 100–104, 105–110, 110+ as additing additional latent processes increases the number of parameters to estimate.
A single observation in Deaths or Death Rates consist of the number of entries equal to the number of countries included into the study (i.e., 31) times the number of age groups, that is 24. It accounts for the information in time t available for all 31 countries among all 24 age groups.

7.1. Preprocessing of Data

Human Mortality Database (HMD) team applies several preprocessing procedures that aim to “clean” Death counts and population sizes before using them in order to calculate and distribute death rates and life tables. The subsequent steps are discussed in the technical report (Wilmoth et al. 2007). The adjustments are applied in order to distribute people of unknown age across age groups and splitting data into age categories, i.e., from age stratification “ 5 × 1 ” to “ 1 × 1 ” and from “ 1 × 1 ” to Lexis triangles. The common practice is to use a regression model for splitting deaths counts in format “ 1 × 1 " to Lexis triangles and apply cubic splines to split “ 5 × 1 ” to “ 1 × 1 ”. Additionally to the adjustments applied to Deaths counts, the age specific population size is estimated using four methods: linear interpolation, intercensal survival, extinct cohorts and survivor ratios.
The life tables are calculated using Lexis triangles and population sizes. Before death rates are converted into the probabilities of death, the rates at older ages (80 and above) are smoothed using logistic regression. The “abridged” life tables are calculated based on the Lexis triangles tables rather than the raw data. It ensures the both sets of tables to contain identical values of life expectancy and other quantities.
Recall that the smoothing applied to the mortality data can influence the feature extraction and the diminish the effect of robust versus non-robust versions of feature extraction methodology which we study in this paper. The topic is further discussed in Section 8.
Additionally to the briefly discussed procedures which have been already applied to the data in HMD, we needed to adjust the data to provide reliable information about missing values. We notice the ambiguity in labelling unavailable data which is either denoted by “NA” value or “0”. Death Rates, Birth counts and Life Expectancy at Birth are unlikely to produce values equal to “0”. Hence we replace all “0” which appeared in these data sets by “NA”. The zeros which appear in Deaths counts in older age groups are more difficult to handle as there is no certainty whether there was no person in particular age group who died or the record has missing values. Due to this fact we decided to limit our analysis to age groups up to 90 and again replace any “0” by “NA”.

7.2. Missing Data

The following subsection is a summary of different types of missingness across countries, age groups and sexes which occurs in the demographic data set analysed in our study. The findings of the subsection are the following: among the Birth related data where a total observation is of vector-type (one dimensional information per country), the incompleteness of data is due to a general unavailability of the information per country. However, for Death related data where a total observation is of matrix-type (a age specific vector of information per country), we notice a patter of single missing values in time point which fits definition of MAR. In the subsequent section with the empirical analysis of the data set under the derived framework of Probabilistic Principal Component Analysis, we assumed MAR type of missingness. Extending this assumption requires calculating the integral from Equation (34) and incorporating it to PPCA framework.

7.2.1. Observed Patterns of Missingness

Missing data appear when no value is available for a component of the observation vector. The Probabilistic Principal Component Analysis with missing data discussed in Section 5.3 handles missing data by filling them with projection using principal components that are calculated from available information. Thus, the more missing data is present for a given variable, the less impact this variable has on the specification of the projection over the assumptions. The results for variables with missing data are more influenced by our assumption of the distributions from Section 5, as will be discussed in detail below.
In order to handle missing values we need to understand and study how their pattern. In the following part we demonstrate the three patterns of missingness which are present in the analysed data sets:
Type 1:
no information about a variable in a few observations for a given country;
Type 2:
no information about a variable in all observations for a given country;
Type 3:
general unavailability of information about all variables for a country except for a limited set of observations.
The analysis is conducted separately for four data sets and sexes. We show the results for four cases where we segment the data among four proportions of missing entries per total observation with maximal percentage of missing entries equal to: 0%, 25%, 50% and 75%. For instance, a single observation of Birth counts is 31 dimensional. When we analyse the data set with respect to the case 50%, we exclude all observation, when the number of missing entries per observation is greater than 0.5 × 31 15 .
In the case of no missing data, that is 0% for Death counts and Death Rates in Females, the number of rows without missing entries is too small for any significant analysis. Due to this fact we drop the minimal number of columns which have the highest number of missing entries in order to collect significant sample for our analysis.
The results of the analysis are similar among Birth counts and Life Expectancy at Birth and Death counts and Death Rates. Hence we discuss the patterns of missingness in Number of Births and Number of Deaths.

7.2.2. Births

The left plot of Figure 1 shows the percentage of missing entries per observation vector of total births over all countries considered versus calendar years for Births counts disaggregated for Females and Males. The sample starts in 1751 (Swedish data) and spans to 2014. Until 1946, the sample has a percentage of missing entries above 50% (the middle vertical red line on the plot). The sample of case 25% starts in 1950. We observe the same missing values pattern between female and male population (the corresponding lines for populations overlap and only the black line, corresponding to male population, is visible).
Figure 2 indicates the availability of data per country (y axis) versus the calendar year (x axis). The black colour denotes points in time when the data is not available for a given country. The Swedish data is not labelled by any black entry except in 2014 as it is the longest time series. Recall that the missing data pattern which is characteristics for this data set is a limited availability of data for a given country. From the empirical analysis of the data we learn that the missing entries do not appear randomly. However, it does not violate the assumed behaviour of missingness and we can still proceed with the methodology of feature extraction described in the previous parts of the paper.
The red vertical lines correspond to the starting points of the subsamples when the maximum number of missing entries per row (an observation in time) is equal to (from the left side on corresponding plots) 75%, 50% and 25%. The subsamples for the cases 25% and 0% provide with principal components which are determined by almost equal sizes of information from every country included into the analysis for the examined period of time. However, the calculation of the components for the subsample in the case 75% is dominated by the time series of countries which are not available before 1950. Therefore, the calculated features are more prone to be impacted by the distribution assumptions and convey less information about dynamics present in the examined period of time that the components obtained from longer time series.

7.2.3. Deaths

The single observation of Death counts is a 504 dimensional vector which reflect the numbers of death per country and age group in time. The proportion of non missing entries per calendar year, again disaggregated by gender for Death counts is shown on the right plot in Figure 1. The percentage of missing entries decreases slower than for Births counts (the slope of the curve is flattener) which indicates the longer distance between the cases 50% and 25%. We notice the discrepancies in the patterns of missing values between female and male population. Interestingly, the female population data has no observations without missing values.
It is also informative to demonstrate the pattern of missing values per age group, as displayed in Figure 3 and Figure 4 below. Red vertical lines correspond to the starting points in calendar time for the total proportions of missing data corresponding to the cases 50% and 25%. We observe a new pattern of missing values: particular variables have a few missing observations within the subsample. The pattern occurs mainly after 1950 in age groups between 1 and 25 (darker shade of blue for single observations) or in age group 95–100. The other interesting analysis can be done when we display the patterns of missigness which are present by age and per country, which is shown in Figure 5 and Figure 6, again disaggregated by gender. The pattern is present only for the subset of countries. For instance, data related to the Irish population has higher percentage of missing values among only four age groups. Still, the dominant pattern of missingness is the availability of the information for a country which is limited to some period of time.

8. Feature Extraction from European Demographic Data via Probabilistic Principal Component Analysis

The following section provides results for the feature extraction using the methodology introduced in Section 5 and applied to different types of European demographic data sets. The attention is drawn to the effect of simple and straightforward robustification overviewed in the previous sections which application we demonstrate on overviewed data sets. The main observation from this study is the difference in the consistency of the features over time and proportion of missigness for two frameworks, robust and non-robust. The features calculated using the means of robust estimators are more consistent over time and over different proportions of missigness that their non robust alternatives. It is especially visible for the features extracted from data that has not been previously preprocessed, e.g., Birth counts. The effect of robustification is smaller if a data set has been smoothed as Life Expectancy at Birth.

8.1. The Assessment of the Methodology

We conduct a comparison between robust and non-robust Probabilistic Principal Component Analysis (PPCA) in a stochastic setting. We undertake this exercise in order to incorporate the most meaningful eigenvectors as exogenous factors to the model in Equation (8). Each of the datasets discussed in the section is treated separately, that is, we compute the eigenvectors for Births, Life Expectancy at Birth, Deaths and Death Rates. Recall, an observation in time t, y t , conveys the information about a given data set in calendar year t from the 31 countries listed in Table 1. Hence, an observation of Births or Life Expectancy is 31 dimensional. Since Deaths and Death Rates carry the information which is age group specific, a single observation in these data sets is equal to the number of countries times the number of age groups.
The following section summarizes the results of PCA according to
  • Population: Females and Males;
  • Subsamples referring to maximum allowed proportion of missing values per an observation: 0% (no missing), 25%, 50% and 75% as discussed in Section 7.2;
  • Type of the standardisation procedure: robust and non-robust, which are used for the estimators of location and covariance in the PPCA framework.
We use M-estimators of the covariance and mean as a robust alternative to the sample estimators. As discussed in Section 6.1.2, the class of M-estimator has a very good performance in small dimensions as its robustness is a function of the dimensionality and not the sample size. Since the data we use is not a long time series and we standardize every variable marginally (Algorithm 1), this simple estimator should provide us with reliable outcomes. In particular, we consider the Huber type M-estimators as discussed in Remark 5 which are characterized by normal asymptotic distribution with convergence rate of 1 2 and several characteristics ensuring both uniqueness and optimality of the estimation Additionally, it is the estimator of covariance which has the minimal asymptotic variance among all estimators for Gaussian data. Thus, the choice is consistent with our assumption of normally distributed data for the treatment of missingness in the PPCA framework which we outlined in Section 5.
Among the objects which we analyse are eigenvectors, eigenvalues, scores and Mahalanobias distances which use estimated covariance matrix C ^ = W W T + σ 2 I d calculated by iteratively evaluated σ 2 and the projection matrix W . The Mahalanobias distance is measured around vector 0 (the data has been centred), and is therefore given by
d n 2 : = d 2 y n , C ^ = y n C ^ 1 y n T .
The results show how distant from the assumed long term mean is a single observation.
The EM algorithm described in Section 5 provides the eigenvectors corresponding to the k largest eigenvalues. Due to the specifics of the data (small number of observations), it would be difficult to incorporate many factors into the model from Equation (8) and achieve reliable estimation results. This is primarily a result of the curse of dimensionality in the parameter space that would lead in this case to diffusivity in the resulting likelihood utilised in the estimation. The latent states and static parameters will become difficult to filter and estimate. Thus, we limit our analysis to k = 3 main eigenvectors which explain most of the variability as shows the standard PCA with non missing data conducted separately for each country showed.

8.2. Births

The results of PPCA for Births counts among Females and Males are similar. It is the outcome which follows general intuition as there is no external factor which influences differently the births of woman and man in European countries. Also, recall that the Birth counts are the least pre-processed data set in out analysis.
Figure 7 shows the Mahalanobias distances (x axis) of Number of Births over the time (y axis) for female (a) and male (b) population. Each sub-panel consists of two plots, which present results for data being standardized by robust (lower plots) and non-robust (upper plots) estimators of the mean and covariance matrix. Different colours of lines depict distances for subsamples where maximum allowed proportion of missigness per an observation is 0%, 25%, 50% and 75%. Recall, that the subsamples starts in different times and therefore the corresponding results are the outcome of the estimation on different data with different impacts of distribution assumptions.
As expected, the effect of robust standardisation is substantial since the framework produces features by down-weights outliers in data. The distances which correspond to the robust standardisation, are more aligned historically, that is, are more ’robust’ when the missingness increases. It indicates that robust standardisation of data captures more efficiently the characteristics of the population distribution. Recall the earliest non-robust Mahalanobias distance of 50% case which is very distant from the statistic in the same calendar year but for the 75% missingness case. For the robust case, the corresponding distances are more aligned what demonstrates the effect of robustness.
Since the subsample of the 75% missingness case is substantially longer, we expect the PPCA results to be different as the sample captures more regimes present in demographic data. Also, the sample has higher number of missing values which are estimated using the projection based on assumption of normal distribution. It also impacts the outcomes of PPCA.
The eigenvalues of estimated covariance matrices are shown in Figure 8. Different colours of lines highlight eigenvalues which correspond to different cases of missingness. Upper panels show the results for non-robust framework whereas bottom plots for the robust one. The magnitude of eigenvalues as well as the spreads between them over different levels of accepted missigness are higher for robust case.
The 75% case of missigness results in smaller discrepancies between eigenvalues for robust and non-robust frameworks. The corresponding eigenvectors exhibit similar behaviour what is shown in Figure 9. The discrepancies between the robust and non-robust eigenvectors are more significant for the cases with smaller proportions of missing values per observation. This outcome can be justified by the fact that the case of 75% is more affected by the priori assumptions of the normal distributions. The discrepancies between two methods of standardisation got smaller as the projection of missing values starts to dominate the estimation of the principal components. The robust and non-robust estimators similarly capture the information about the normal distribution.
The blue dotted vertical lines on the plots of eigenvectors disaggregate the outcomes into developed and developing countries listed in Table 1. Regardless of the case of missingness (except 75% case) and type of the standardisation, we notice resembling features for countries from each of the groups.
We would expect the alignment of the robust scores as it has been observed for the corresponding Mahalanobias distances. However, the described PPCA methodology does not re-estimate the mean. The data is centred one during the initialisation. In the presence of missing values, their projection changes the mean and the data is centred only at the start of the procedure. This, the mean is only static when there is no missing data. This simplification of the framework results in different levels of the scores in Figure 10.

8.3. Life Expectancy at Birth

As mentioned in Section 7.1, Birth counts are the only data set in our analysis, which has been not modified or preprocessed before being available in HMD. The stages of several adjustments which are applied to the Death counts and population sizes result in a smaller number of outliers in Life Expectancy at Birth. The outcomes of PPCA for the data which is standardized using robust and non-robust estimators of mean and covariance matrix do not vary so significantly as in Birth counts over different proportions of missigness.
The robust Mahalanobias distances in Figure 11 are slightly more distant than their non-robust equivalents. However, the distances for different cases of missing values are similarly aligned in both cases in contrast to Birth counts. We may observe the same pattern among the scores of three principal components in Figures 15 and 16.
Figure 12, Figure 13 and Figure 14 show eigenvalues and eigenvectors for Females and Males respectively. Only the results for the 75% case of missingness exhibit more variation among the standardisation procedures. Since the subsample corresponding to 75% case is significantly longer, the discrepancies can be again rationalized by the two reasons: effect of the priori assumption on distributions in Section 5 and higher number of captured regimes in the data. Moreover, anakysing the outcomes from the aggregation among the European countries, we notice that the second eigenvector has opposite signs for the countries of the two countries groups. Recall that its values are more volatile for the male population of developing countries (except for the case of 75% of missing values).
The obtained scores for Life Expectancy at Birth are shown in Figure 15 and Figure 16. The levels of scores are very close to zero. It is an expected outcome as we differenced the data since it exhibits polynomial trend. The distribution of differences is expected to have zero mean. Hence, the scores are aligned for all the cases of missingness and distributed around zero what is negligibly affected by the simplification of the discussed framework. However, the two methodologies of the standardisation result in different magnitudes of scores. Since the corresponding eigenvectors are very similar for two standardisations, the magnitude is a reliable indicator whether an observation is outlying.

8.4. Deaths

The Mahalanobias distances for the Deaths counts exhibit resembling behaviour for Females regardless of the standardisation procedure. The discrepancies between the statistics are more substantial for the male population what is highlighted in Figure 17.
The eigenvalues of the non-robust estimator of the covariance matrix does not vary between the cases of missigness up to 50%. The plots are shown in Figure 18. For the case 75% the first eigenvalue starts to dominate more significantly than for other missigness cases, especially for Females. On the other hand, the robust eigenvalues are more volatile. Especially the results for Females provide unexpected outcomes for the case 50% in comparison to the case 25% even though the subsamples for these cases differ only by a few calendar years in mid 1940. With regards to the Males, we observe that the dominance of first eigenvalue increases with the number of missing values.
The corresponding eigenvectors are shown in Figure 19 and Figure 20 for Females and Males respectively. The colours of the heatmaps correspond to the magnitude of components of eigenvector which are country (x axis) and age group (y axis) specific. The non-robust estimation results in the eigenvectors with smaller magnitude and smoother within the age groups and countries. Recall, that the distribution of colours for the robust case has bigger spreads between values (so called “bumps”) what is highlighted by more intense colours of blue and red. The exception is made for the case of 75%.
The vertical black dotted lines on the heatmaps divides the countries listed on x axis as developed (left side) and developing (right side). Again, this order of results stresses the differences between the eigenvectors within these two groups of countries. The first eigenvector for developing countries has the break point around age group of 40 for all cases of missingness for the male population and all cases except 50% for the female population. The eigenvector for developed countries has a break point for age group in 80 for female population and 75 for male population with additional break in 35 for 0% and 25% cases.
The case of 75 is analysed separately. The first eigenvectors do not differ within two types of standardisation but exhibit the structure which is country group specific. The developed countries are characterized by the uniformed values around zero for all age groups. The vectors of developing countries are more volatile with breaks around 20–30 for Males and even more volatile for Females. The second eigenvector differs within both types of standardisation and among two groups of countries. It is almost constant around zero for Males in developing countries and more volatile for Males in developed countries with breaks in age groups between 50 and 60. Second eigenvector for Females resembles third eigenvector of Males.
The scores are presented in Figure 21 and Figure 22. The second and third scores of Males are smoother in contrary to results for Females. The exception is made for the case 75% which results vary both among sexes and standardization procedures.

8.5. Death Rates

The analysis for Death Rates provides with similar conclusions as for Death counts. The corresponding Figures are Figure 23, Figure 24, Figure 25 and Figure 26 respectively. The only discrepancies are exhibited by the eigenvalues for Males for the 75% case as they are more aligned and higher in terms of magnitude from other cases than the corresponding eigenvalues in the Deaths counts analysis.
As shown in Figure 27 and Figure 28 the robustification does not influenced greatly the estimation of eigenvectors. As the Deaths Rates are preprocessed and smoothed before being distributed by HMD, we again conclude that the preprocessing decreased number of outlying data points. In this particular case, the robust standardisation is similarly informative about the the true distribution as the non robust one.
Also, recall the similarity of results among different cases of missingness, especially for first eigenvector. The second and third ones are smoother for high levels of missigness. The colour map is affected by the scaling parameter ( 1 , 1 ) which may cause red to become blue, but except this fact, we see resembling outcomes.
The three most meaningful eigenvectors differ among two groups of analysed countries: developed and developing. The corresponding results are divided by the vertical black dotted lines. The first eigenvectors of developed countries is very smooth among age groups. For developing countries we observe U shape structure with peak in 45–60 age groups for Males and similarly for Females in the cases with small proportion of missigness. When we allow more missing values, the eigenvectors for Females in developed countries are closer to zero and flat among age groups whereas for developing countries are more distant and volatile.

9. Stochastic Mortality Models for UK utilizing Factor Extraction from European Demographic Data

In this section we demonstrate the results of incorporating features extracted from European demographic data in the stochastic mortality model for British female mortality data over a study period of 1922 to 2014 with 10 years ahead forecasting validation.
The key findings are that the utilisation of demographic features improves the in-sample and out-of-sample predictive posterior mean Bayesian point estimation and forecasts for the log death rates. Additionally, the employed robustification methodology reduces the variances of the error terms in both observation and state equations and produces a better out-of-sample fit than its non-robust alternative. It indicates that the features extracted in the robust manner are more consistent over time and capture better the information about the true distribution of the demographic data.
The model which has the smallest mean square error of estimation and prediction adds the age-specific components to the latent process. It is later referred as DFM-PC-B. However, the other examined models, which also improve the predictability of the log death rates, are useful in terms of the interpretation as they reveal the individual country specific impacts of each of the European countries data on British female log death rates.

9.1. Description of the Models

Before presenting the real data example we note that all the Bayesian models developed and Markov chain samplers constructed were first tested on synthetic case studies in which the true parameters and state variables are known. The performance was found to be very good and this provides confidence in the accuracy and performance behaviour of the methods and models developed. The synthetic study results are provided in technical appendix and are not discussed in this paper. The use of synthetic data enables us to validate the estimation by Forward-Backward Kalman Filter with Gibbs Sampler. The models that we considered in our simulations and empirical studies are labelled by
LCC: 
Lee-Carter model with the stochastic cohort effect given by Equations (2) and (3);
DFM-PC: 
Demographic factor model which incorporates ϱ t into LCC given by Equation (8). Please refer to Appendix B for illustration how the models from this class are created;
DFM-PC-B: 
The mean of first principal component of Birth counts as a static parameter, age specific element of ϱ t ;
DFM-PC-D-r: 
The first principal component of Death counts ( which is age and country specific) as an exogenous factor, one element of ϱ t corresponds to a country specific subvector of the component, robust standardisation;
DFM-PC-D-s: 
The first principal component of Death counts ( which is age and country specific) as an exogenous factor, one element of ϱ t corresponds to a country specific subvector of the component, non-robust standardisation;
DFM-PC-Mx-r: 
The first principal component of Death Rates ( which is age and country specific) as an exogenous factor, one element of ϱ t corresponds to a country specific subvector of the component, robust standardisation;
DFM-PC-Mx-s: 
The first principal component of Death Rates ( which is age and country specific) as an exogenous factor, one element of ϱ t corresponds to a country specific subvector of the component, non-robust standardisation;
The models of the class DFM-PC address Case 1 from Section 3, where factors are incorporated into the observation Equation (2). The factors are obtained by performing PPCA jointly on the set of data for all countries listed in Table 1 excluding the following specific countries: United Kingdom (as it is our response variable), Greece and Slovakia (due to short time series).
DFM-PC-B incorporates the mean of the first principal component of the Birth counts which is a country specific vector. The matrix F ˜ is a 21 × 21 diagonal matrix with the mean on the diagonal. Hence, ϱ t which correspond to the model DFM-PC-B, is a 21 dimensional, age specific state process and attempts to capture an age-specific dynamic in addition to the cohort-period effects.
The models DFM-PC-D-r, DFM-PC-D-s and DFM-PC-Mx-r and DFM-PC-Mx-s incorporate the first component of Death counts and Death Rates, respectively. Recall that the components for these data sets can be presented as age specific and country specific matrices as shown in Section 8. Due to the high dimensionality of the problem, we want the one element of ϱ t to correspond to the subvector of the first component which is specific only for one country. Such a subvector has 21 dimensions which correspond to the age groups. Hence, ϱ t is a 28 dimensional country specific state process. The country specific subvectors of the first components are placed in the columns of the 21 × 28 matrix F ˜ . The last letter of the name of the models DFM-PC-D and DFM-PC-Mx denotes the type of the standardisation which is applied to the data before performing PPCA: robust (by M-estimator) or non-robust (sample estimator).
In the following part we analyse the population mortality from United Kingdom based on the models listed above and Bayesian methodology studied in this paper. We then examine the models in terms of the forecasting properties of death rates.

9.2. Setup

For the Bayesian estimation of models, we assume the priors given in the Appendix A.3 and Appendix A.4 to be
κ 0 N ( 0 , 10 2 ) , , γ 0 N ( 0 , 10 2 ) , α x N ( 0 , 10 2 ) , β x N ( 0 , 10 2 ) , σ ε 2 IG ( 2.01 , 0.01 ) , θ N ( 0 , 10 2 ) , η N ( 0 , 10 2 ) , λ N [ 1 , 1 ] ( 0 , 10 2 ) , σ κ 2 IG ( 2.01 , 0.01 ) , σ γ 2 IG ( 2.01 , 0.01 ) , ϱ 0 i N ( 0 , 10 2 , ) [ Ω ] i , j N ( 0 , 10 2 ) , Ψ j N ( 0 , 10 2 ) , σ ϱ 2 IG ( 2.01 , 0.01 ) .
The number of iterations of the Markov chain is 50,000 for LCC model and 200,000 for other models with 90% burn-in. The chain is initialised at α = y ¯ 1 : T , β x = 1 21 , σ ϵ 2 = 0.0005 , θ = 0.005 , η = 0.02 , σ κ 2 = 0.01 , σ γ 2 = 0.0005 , σ ϱ 2 = 1.0 , [ Ω ] i , i = 1 m for m being either number of countries or number of age groups (depending on the model). The convergence of the sampler has been tested on synthetic data studies. The synthetic data study revealed that the estimation of the drift parameters corresponding to the factor state process model ϱ t converges very slowly for shorter time series such as those found in mortality data. Thus, we decided to set these parameters to zero and do not sample them in this study.

9.3. Estimation of Static Parameters

Estimated values of the static parameters (except α , β and Ω ) for the British female mortality data (1922–2003) are shown in Table 2. The rest of the estimated static parameters is displayed in Figure 29, Figure 30 and Figure 31. The results are shown under different models listed in the first columns of the table or indicated by the colour of lines on the plots.
The static parameters of the factor process under DFM-PC-B model are age specific. In addition to the cohort and period effect, they provide supplementary information related to the corresponding age groups. Figure 29 shows the estimated diagonal elements of the transition matrix Ω under the model. The parameters with values close to unity indicate that the factor state process corresponding to these parameters have a slowly decreasing dynamic. The elements of the state processes which correspond to the values of parameters closer to zero, are characterized by higher decrease. The parameters which are negative and close to zero indicate that the corresponding latent state processes fluctuate around zero.
With regards to DFM-PC-B model which incorporates age specific latent processes (supplementary to cohort effect), the elements of Ω which are positive and close to zero, describe the decreasing dynamic of the corresponding age specific processes. These latent processes are shown to have more significant impact on death rate modelling when the sample starts, however, this impact decreases over time and the cohort effect becomes sufficient to model log death rates in these age groups. For instance, recall the age groups between 70–80 in Figure 29. Such process can be interpreted as a period effects which are specific for particular age groups. The elements of Ω which are close to unity describe the latent process which have consistent impact or its lack over the time. If they have an effect on log death rates (i.e., their domain is not close to zero), they covey age-specific information which is supplementary to cohort and period effect and consistently demanded by the model over the time.
With regards to Ω estimated under models DFM-PC-D-r, DFM-PC-D-s and DFM-PC-Mx-r and DFM-PC-Mx-s, it refers to the country specific features. Here the parameters are related to the influences of the specific countries on British female log death rates. The plots with the estimates and their confidence intervals are displayed in Figure 31. For instance, estimates of Ω under all models agrees on lack of effects of Austrian or Bulgarian demographic data on the log death rates over whole sample span. On the other hand, the estimate of the parameter corresponding to Belarusian eigenvector is close to unity under all models and therefore highlights the informative effect of the feature on British log mortality rates which is consistent over the times.
It is worth to point out that the values of estimated variances of the observation and state equations error terms are higher for DFM-PC-D-s and DFM-PC-Mx-s where the data has been non-robustly standardized. These models are examined to have a greater mean square error of in-sample and out-of-sample fit than their robust alternatives as shown in Table 4. Hence, the robustification procedure employed in this study improved the overall goodness of fit of the considered models. The features which have been extracted from European demographic data by means of robust estimators of mean and covariance are shown to provide the information which is consistent over the times and conveys the better knowledge about the true distribution of the demographic data sets.
We did not choose to calculate the MLE estimates of the parameter’s in our models as it has been documented that even for the standard period-cohort type Lee-Carter stochastic mortality models, the classical MLE estimation frameworks can produce convergence and estimation challenges due to gradient based and method of scoring recursive optimization methods getting stuck in local optima of the marginal likelihood surface. We refer the interested reader to the paper Fung et al. (2017) where we discuss such issues in more depth. Therefore, instead of resolving the known problems that may arise with classical MLE estimations of such models, which may be further compounded in the extended models we developed in the frequentist setting, we have chosen to stick with the Bayesian modelling paradigm and to report an analogue result to the MLE that may be obtained from Bayesian inference, in the case of uninformative priors. That is we have relative uninformative priors and so we can report the Maximum a-postiori (MAP) posterior mode estimator for the parameters as the Bayesian analogue of the MLE, defined as
ψ M A P = argmax ψ π ( ϱ 0 : T , ψ | y 1 : T )
for π ϱ 0 : T , ψ | y 1 : T being a joint posterior density of the states ϱ 0 : T and the vector of static parameters ψ given the observation y 1 : T as introduced in Appendix A.
The MAP estimates and MLE should be similar in the case of uninformative priors, with the advantage that the MAP estimation is obtained via an MCMC sampler output, which is less prone to the types of estimation challenges experienced in gradient descent methods working directly with the Instead, as we used fairly uninformative priors, note that the MAP estimate of the posterior is a case of uninformative priors will correspond to MLE estimates. Please refer to Table 3 for the analogous of point estimates to the outcomes of Table 2.

9.4. Filtering of Latent Variables

The Bayesian posterior mean estimates of the latent stochastic mortality factors in the models for κ t in the top panel and for γ t 0 in the bottom panel of Figure 32. The colours of lines denote the filtered processes under different models. As expected, adding new state variables related to the factors significantly changes the dynamics of the period and cohort effect state processes. The blue line correspond to the cohort=period only LCC model. The increase of κ t and decrease of γ t 0 at the end of the sample is greater for this model in contrast to the the other examined models.
The dynamic of the cohort effect latent process vector γ t over time, which is age group specific, is shown in Figure 34. The panels correspond to the Bayesian posterior mean estimates of the process under different models. The age group specific features, which has been utilised in DFM-PC-B model, clearly provide LCC model with supplementary information to the cohort effect state processes. The corresponding ϱ t reduces the variability of the cohort effect process with comparison to the cohort effect estimated under LCC model (the colours of the surface on DFM-PC-B panel are plain and variance of the error term is smaller). The state processes corresponding to the factors under DFM-PC-B model are shown in Figure 33. Recall that in the contrast to the cohort and period effects processes, the latent process vector ϱ t under the model DFM-PC-B has age group specific stochastic components ( κ t is calendar year specific and γ t is ’0’ age group and calendar year specific). As κ t latent process estimated under DFM-PC-B is not distant from the corresponding estimated under LCC model for majority of the sample, we can conclude that the additional state processes given by this model provide supplementary, age-specific information to the cohort effect process γ t 0 .
The models DFM-PC-D-r, DFM-PC-D-s, DFM-PC-Mx-r and DFM-PC-Mx-s are characterized by the processes which correspond to the country specific vectors with elements related to age groups. The models attempt to address the question whether the structure of demographic data from European countries can be an efficient explanatory variable for British mortality. Figure 32 and Figure 34 show that dynamics of period and cohort effect state processes are sensitive to different sets of features extracted from demographic data as well as a standardisation methodology. Figure 33 and Figure 35 provide insight into how these variables change the information extracted from the models. The information highlight the impact of the first components of the considered countries on British mortality rates. As the components are orthonormal, ϱ t c o u n t r y indicates the magnitude of this effect. To be consistent with the country set specific notation from Section 8, the blue vertical line divides the results into two categories: results for the developed countries (below the blue line) and developing countries (above the blue line). The processes which moves closely to zero give information that the data of country they correspond to has small influence on the mortality rates from United Kingdom. The developed models when incorporating the demographic features from the European countries, indicate the significance of the factor loading from a given country on the mortality of UK. Hence, we can specify the countries which has a positive effect on the mortality of United Kingdom (factor state processes are negative), neutral (factor state processes are fluctuate around zero) or negative (factor state processes are positive and enlarge log death rates).
For a given model it is not true that all factors which correspond to the age specific vectors of features from European countries, have an influence in the causal fashion on the UK morality data in the same way. As such, what we are showing in the plots Figure 31 (or Figure 29 for DFM-PC-B models) that some countries have very wide posterior credible intervals (the flat posterior) for Ω which is in an alignment with the findings in Figure 36 (Figure 33 for DFM-PC-B model) where we see that indeed for those countries the ϱ t upon a model indicate insignificant effect on UK log mortality rate. Let us consider the example of DFM-PC-Mx-s (the red colours of lines on both of the plots) for Belgium (BEL) and Austria (AUT). We see in Figure 36, that the effect of Austrian factor to the British mortality data, expressed by the dynamic of ϱ A U T , is non-zero over the time. On the other has, the element of ϱ t corresponding to Belgium, labelled by ϱ B E L does not load significantly in any way on the UK mortality experience. As a consequence we see that the posterior for this country on Ω is also very flat when the credit intervals of Ω A U T on Austria are significantly narrower. This simply means that the factor loading of Belgium does not influence the UK mortality experience.
Hence, to conclude this discussion, we note that when we look at the results in Figure 36, they show the effect of each individual countries influence on the UK mortality experience. In fact, what we learn is that some countries have a mean of 0 with large uncertainty, these countries maybe interpreted as not having an influence on the mortality experience of the UK.
The four models are more consistent about the set of countries which does not have any effect on the log death rates of United Kingdom Females. The models corresponding to the non-robust standardisation indicate bigger impact of western Europe countries whereas their robust alternatives indicate the significance of the patterns from Easter and Central Europe countries as Lithua, Poland or Russia.

9.5. In-Sample and Out-of-Sample Performance

In this section, we investigate the in-sample and out-of-sample performance of the mortality models summarized in the introduction to this section. The model selection is based on two in-sample performance measures, MSE (Mean Square Error) and DIC (Deviance Information Criterion), whereas the forecasting performance examined based on MSEP (Mean Square Error of Prediction) using two forecasting distributions of log death rates, the one obtained by Gibbs sampler and the one provided by Kalman Filter.

9.5.1. Model Selection

The device information criterion is a popular measure in Bayesian setting which trades off model fit against its complexity (the effective of parameters) as introduced in (Spiegelhalter et al. 2002). Among the various versions of DIC we decided to use so-called conditional DIC which treats the latent states as parameters when calculating the conditional loglikelihood, for details please refer to (Celeux et al. 2006). The conditional loglikelihood if given by
log L y t | φ t , ψ ψ , φ 1 : N ; y 1 : N = 1 2 x = x 1 x p t = 1 N log 2 π σ ϵ 2 + y x , t α x [ B ˜ ] x , · φ t 2 σ ϵ 2
By denoting ψ = ( φ t , ψ ) we define the deviance of the model as follows
D ( ψ ) : = 2 log L y t | ψ ψ ; y 1 : N .
The function h as it is independent of the model specifications, is usually considered to be equal to 1. Hence, the effective number of parameters is defined as
p D : = D ¯ ( ψ ) D ( ψ ¯ )
where D ¯ ( ψ ) is a mean of deviance over different samples of the vector ψ and D ( ψ ¯ ) is a deviance of the posteriori mean of the vector of parameters  b m ψ . The DIC then is defined as
D I C : = D ¯ ( ψ ) + p D = 2 D ¯ ( ψ ) D ( ψ ¯ )
which can be calculated using the MCMC samples.
In addition to DIC, we calculate the mean square errors (MSE) for considered models, defined as the mean of the difference between the observed data, y t , and the mean of the in-sample one-step ahead model forecast given by Kalman F, f t = E [ y t | ψ , y 1 : ( t 1 ) ] given in by Equation (A3b). Therefore, we define e t : = y t f t and
MSE ( ψ ¯ ) : = E e t e t T
where for the point estimator of the vector of static parameters ψ we use the vector of posterior means.

9.5.2. Forcasting Distribution

The Bayesian state-space framework allows us to obtain the forecasting distributions using MCMC samples given by
π y T + m | y 1 : T = π y T + m | φ T + m , ψ π φ T + m | φ T + m 1 , ψ π φ T , ψ | y 1 : T d ψ d φ T : T + m
where ψ is the static parameter vector and φ t is a state process vector. Following [104 lcstatespace], by sampling recursively, we obtain the following forecasting distributions, when ( i ) denotes a sample
φ T + k ( i ) N Λ ˜ ( i ) φ T + k 1 ( i ) + Θ ˜ ( i ) , Ψ ˜ ( i ) y T + k ( i ) N α ( i ) + B ˜ t ( i ) φ T + k 1 ( i ) + Θ ˜ ( i ) , σ ϵ 2 ( i ) I d .
Alternatively, we can use the forecasting distribution given by the Kalman Filter, that is
φ T + k N Λ ˜ φ T + k 1 + Θ ˜ , Ψ ˜ y T + k N α + B ˜ t φ T + k 1 + Θ ˜ , σ ϵ 2 I d .
for the static parameters which has been estimated by averages within sampled realisation provided by the Gibbs sampler. Let us define mean square error of prediction function as follow
MSEP ( ψ ) : = E y T + k E y T + k | y 1 : T , ψ y T + k E y T + k | y 1 : T , ψ T .
Therefore the mean square error of prediction using MCMC distribution is calculated as a mean of MSEP ( ψ ) over different samples of the vector ψ and denoted by MSEP M C M C . The mean square error of prediction using the distribution provided by Kalman Filter is calculated the posterior mean of the vector of parameters ψ .

9.5.3. Comparison of the Models

We choose for out-of-sample study last 10 years of the available sample for British Female death rates. The calibration period is 1922–2002. The Table 4 summarizes the calculated mean squared errors of the estimated observations using Kalman Filter (MSE), deviance information criterion (DIC) and mean square errors of predictions using the MCMC distribution ( MSEP M C M C ) and the Kalman Filter distribution ( MSEP K a l m a n ). The results confirm that adding the features, which has been extracted from demographic data, as an additional explanatory variable to the LCC model improves both in-sample fit out-of-sample fit and therefore the predictability of log death rates. The plots with age group specific prediction results can be found in Figure 37.
For the in-sample performance, the MSE and DIC agree to the group of two best performing models, DFM-PC-B and DFM-PC-Mx-r, however are conflicted with regards to the group of two worse performing models. Due to assessing the performance of the model considering its complexity, DIC more successfully captures the models which result in poorest out-of-sample performance, LCC and DFM-PC-Ms respectively. Especially it is worth to notice significant over-fitting of the LCC model which is further investigated in Section 10. Recall, that using MSE as model selection criterion would not be sufficient to choose the model with good performance. In terms of MSE, the in-sample performance of LCC model is comparable to DFM-PC-B model, while DIC labels the model as one with the worse explanatory power.

10. Additional Remarks on Modelling and Forecasting Results

While conducting the study we encountered two issues which are worth separate discussion, the influence of the stratification on the class of Lee-Carter models and the intuition behind the vector α in the model Equation (8) when we incorporate the demographic features.

10.1. The Affect of the Stratification and Identification Constraints on Estimation of Stochastic Lee-Carter Type Models

We draw the readers attention at this point to the substantially lower predictive accuracy of the Lee-Carter cohort (LCC) model in comparison to the models which employ demographic factors, as shown in Table 4. This is especially important since such LCC models, without factors have been previously documented to have better out-of-sample performance for UK data when no age stratification is applied.
We explain the steps we have taken to explore this feature that the reviewer has pointed out. Firstly, we clarify this is not a problem with the sampler or the prior specification, rather it is related to the particular suitability of different model structures under particular assumptions in the model. We explore this, specifically with respect to age group stratification and its influence on the model fit and performance.
Please note that the stratification was adopted, where we looked at 20 sets of age groups in 5 year buckets to reduce the dimensionality of the model, of course this can influence the model fit and the assumptions made regarding model simplification and cohort interpretation. Therefore, we decided to include additional studies to investigate these effects more carefully and in the process we believe we may also address the question raised by the reviewer on this point. Our attention has been drawn to two points: rapid increase of κ t in 2000 and the substantially lower predictive accuracy of Lee-Carter cohort (LCC) model in comparison to the models which employ demographic factors as shown in Table 3. Since the LCC model has been documented to has better out-of-sample explanatory power for the UK data when there is no stratification, we decided to undertake an additional investigation. In particular, in this new class of studies, we explore in more detail the effect of age stratification and the appropriate choice of adjustment of model assumptions and identification constraints to be performed in order to compare models in a meaningful manner. The details are described below and in the manuscript in Section 10. The notation of the models which we decided to analyse are the following
LC: 
Lee-Carter model
log m x , t = α x + β x κ t ,
with the constraints
x β x = 1 , t κ t = 0 ,
LCC: 
Simplified Lee-Carter cohort model
log m x , t = α x + β x κ t + γ t x ,
with the constraints
x β x = 1 , t κ t = 0 , c = t 1 x p t N x 1 γ c = 0 .
LCCF: 
Lee-Carter full cohort model
log m x , t = α x + β x κ t + β x γ γ t x ,
with the constraints
x β x = 1 , x β x γ = 1 , t κ t = 0 , c = t 1 x p t N x 1 γ c = 0 .
In this study we consider two age group stratifications, the 1 × 1 study which has 100 age groups per year and the more parsimonious class of models given by the 5 × 1 age group stratification with 21 age groups. Since the number of ages groups differs among stratified and non-stratified mortality data, that is, we have 21 age groups for the data in “ 5 × 1 ” format, we expect that the parameters and latent variables, which are estimated using above constraints, may differ in magnitude. Therefore, in order to ensure comparability of the results when we examine the stratification effect on the family of Lee-Carter models, one must therefore to standardize the magnitude of parameters and variables of models in order to compare between the stratified and non-stratified case. We demonstrate in the new studies that such a problem may be resolved via a simple scaling adjustment to the identification constraints in order to resolve this issue. Hence, we introduce the scaling parameter a > 0 . The models with imposed adjustment are denoted by lower index a d j as follows
LC a d j : 
Lee-Carter model
log m x , t = α x + β x κ t ,
with the constraints
x β x = 1 a , t κ t = 0 ,
LCC a d j : 
Simplified Lee-Carter cohort model
log m x , t = α x + β x κ t + γ t x ,
with the constraints
x β x = 1 a , t κ t = 0 , c = t 1 x p t N x 1 γ c = 0 .
LCCF a d j : 
Lee-Carter full cohort model
log m x , t = α x + β x κ t + β x γ γ t x ,
with the constraints
x β x = 1 a , x β x γ = 1 a , t κ t = 0 , c = t 1 x p t N x 1 γ c = 0 .
Lastly, in order to distinguish between the results of the models for stratified and non-stratified data, we denote the models for stratified data with the lower index “ 5 × 1 ” and for non-stratified data with “ 1 × 1 ”, for instance, the results for the Lee-Carter model LC for stratified data are denoted by LC 5 × 1 and the results for the same model for a data without any stratification are denoted by LC 1 × 1 .

10.1.1. The Estimates of the Static Parameters and Filtered Latent Variables

The list of models, which have been examined and are discussed in this subsection, is given in the first column of Table 5. In addition to LC, LCC and LCCF models, we include into the comparison also the models with adjusted constraints for the mortality data in format “ 5 × 1 ” with the adjustment parameter a = # age groups 1 x 1 # age groups 5 x 1 = 100 21 4.762 being a proportion between the number of age groups in in the 1 × 1 stratification and the 5 × 5 stratification of the age groups in the mortality data.
The Bayesian posterior estimators of the static parameters α x and β across age groups are shown in Figure 38 when the Bayesian posterior mean estimates of period effect κ t and cohort effect state process γ t 0 are shown in Figure 39.
The first straightforward remark on the investigation is to note that there is an inconsequential influence of both the stratification and the adjustment to the identification constraint for stratification, when investigating the estimation of the level vector of the model, as denoted by parameter vector α .
Also, the basic Lee-Carter models is not affected by the stratification as both its in-sample and out-of-sample quality of fit is comparable among data with format “ 5 × 1 ” and “ 1 × 1 ”. However, the estimates of β for LC 5 × 1 are greater in the magnitude in comparison to LC 1 × 1 model which appears to be an offset by a smaller slope of the filtered κ t . Importantly, we note that when the adjustment to the number of age groups is imposed, the β and κ t for LC 5 × 1 , a d j are in line to those of LC 1 × 1 and the model still keeps comparable explanatory power. This observation gives us the intuition that the stratification influences mainly the cohort effect. This is something that we intuitively can understand due to the interplay between age stratification and cohort effect.
The results in Table 5 shows that the out-of-sample quality of the fit for the LCC 5 × 1 model is significantly lower than the corresponding result for LCC 1 × 1 . As the adjustment, that is LCC 5 × 1 , a d j model, produces similar outcomes, the discrepancy in the out-of-sample quality of fit between LCC model applied to “ 5 × 1 ” and “ 1 × 1 ” data is not caused by the smaller number of age groups and therefore different magnitude of estimates of β and κ t (recall Figure 38 and Figure 39). Since the discrepancy of the in-sample explanatory power is smaller between the models, we observe that the LCC model tends to over-fit when applied to stratified data.
We begin discussion of these results by noting the following finding. There appears to be an interplay present between the model parsimony and the bias and variance in the results for both in-sample fits and out-of-sample forecasts, as reflected by the Mean Squared Error (MSE) results, which is more largely affected by the model structure rather than stratification effects.
For instance, we see that the more parsimonious model choices, corresponding to say the three LC sub-family of models always had a larger MSE than the less parsimonious class of simplified LCC model. That is the in-sample MSE improved by around an order of magnitude when we incorporated extra structure corresponding to the cohort feature. This was not influenced by the age stratification reformed. We conjecture that although the LC models will have potentially lower variance, due to less model parameters to be estimated, the in-sample MSE is still worse generally due to increased bias that may arise from not capturing sufficiently the stochastic structure of the data.
Furthermore, we also see a pronounced effect of stratification on the out-of-sample forecast performance of the simplified class of LCC models in which no adjustment was made for the stratification effect. This indicates that the adjustment we propose to use when undertaking age-group stratification can substantially reduce the bias in the resulting model estimates when we compare between the simplified LCC 5 × 1 model and the adjusted form.
Thirdly, we observe that the most flexible class of LCC model, the non-simplified LCCF class of models was significantly affected by removing the age stratification of 5 × 1 compared to the 1 × 1 case. To understand this, we have significantly increased the dimension of the model parameters to be estimated in the LCCF 1 × 1 compared to the LCCF 5 × 1 . This we believe produces a poor in sample and out-of-sample MSE and MSEP due to the resulting over-fitting and increased variance in the model estimates, compared to the simplified LCC model equivalents. However, importantly the stratification effect is significant here, the dimension reduction in model parameters in the LCCF 5 × 1 compared to the LCCF 1 × 1 reduces the variance in the estimates of the mortality in sample and out-of-sample as well as providing additional degrees of freedom to also reduce the bias that arises from the constrained version of the LCC 5 × 1 model, resulting in the optimal MSE and MSEP performance.

10.2. The Estimates of the Intercept for the Factor Model DFM-PC-D-r

The following study addresses the interpretation of the estimates of α under the the new class of stochastic mortality factor model in comparison to the standard Lee-Carter models without the matrix with demographic factors. Please refer to Section 10.2. The argument we proposed to interpret α is adjusted to the fact that we have the exogenous factors incorporated into our model compared to standard Lee-Carter model. As such, we argue that the the interpretation of the intercept should now incorporate both α , the classical intercept, and the term F ˜ t ϱ t corresponding to the intercept which arises from the exogenous factors. In the time series context this is considered as a stochastic intercept. Hence the interpretation of α typically adopted in the classical stochastic Lee-Carter type period-cohort models, does not hold under the new model, since we have now incorporated the additional structure corresponding to the regression term from the demographic factors. The expression F ˜ t ϱ t which is added to the observation equation provides with time-varying supplementary information to the static level given by α . To validate this claim we have undertaken the additional studies which demonstrate when we combine the α with this component of the model, the posterior mean of this quantity behaves in analogues fashion to what you would expect on the posterior mean of α in the standard, non-factor class of Lee-Carter models. This is interesting as it shows the factor influence and additional interpretation to the level contributed by the long-term demographic exogenous factors. The plot in Figure 40 shows the Bayesian posterior mean estimates with 95% creditable intervals of α + F ˜ t ϱ t T averaged over time, whereas Figure 41 illustrates the posteriori mean over time. The level of the expression on either of plots is below zero and behaves in a fashion we would have expected from α in standard Lee-Carter models. It confirms our interpretation as well as answers the question asked by the reviewer.

11. Conclusions

We developed and presented a comprehensive study which focuses on the analysis and the incorporation of demographic data into state-space framework for stochastic mortality modelling. We have extended the well-known Lee-Carter model with stochastic cohort effect by introducing new state processes which correspond to the age-specific dynamic of European female log death rates.
We showed by means of Probabilistic Principal Component Analysis the ideas of extracting the meaningful features from demographic data of European countries and applying them as explanatory variables to the mortality estimation and forecasting. In the presence of short time series and different types of missingness, the suggested methodology aims to be as parsimonious as possible. By analysing of the extracted features, we found more evidences about region specific mortality structures. Also, the features exhibit significant sensitivity to the methodology of estimation of moments. As overviewed in Section 8, the robust alternatives to the sample estimators provide with more consistent results regardless to the number of missing entries, especially if the data has been not preprocessed or smoothed.
The results of applying studied models to the British female log mortality data showed that incorporating the features extracted from European demographic data provide valuable information about the mortality forces which affect British population. Also, the models with dynamic factors exhibit better in-sample and out-of-sample fit than the Lee-Carter model with the cohort effect.
As an additional outcome of the study, we analyse the effect of the stratification of the data on the family of Lee-Carter Model. In Section 10 we argue that the stratification influences mainly the cohort effect process and requires more flexibility of modelling than provided by simplified Lee-Carter cohort model. The investigation showed that the standard Lee-Carter model has similar performance for stratified and none-stratified data, whereas simplified Lee-Carter cohort model is prone to significant overfitting when fitted to stratified data. Also, the investigation shows that stratification helps to resolve the issues of overfitting related to Lee-Carter full cohort due to smaller dimensionality of the observation vector.
There are a few ways in which the paper can be further extended. First of all, with appropriate data it would be straightforward to conduct similar analysis for extended data set, as for gender and region specific disaggregation of mortality and demographic data with the inclusion of various population related factors such as cause-of-death Murray and Lopez (1997); (Girosi and King 2008) or more recently (Gaille and Sherris 2015), midlife conditions as in Gavrilov and Gavrilova (2015) or migration. Secondly, considering different distribution priori assumptions and introducing the methodology to handle more advanced patterns of missigness would benefit in better explanatory power and even more consistent interpretability. We can also improve the framework of feature extraction to account for fat tail distributions by means of different robustification methodologies or extensions to Principal Component analysis such as Independent Component Analysis and their functional alternatives as in Shang and Hyndman (2016).

Acknowledgments

Dorota Toczydlowska and GarethW. Peters would like to acknowledge the generous support of the Institute of Statistical Mathematics in Tokyo, Japan for providing the opportunity to visit, present and get feedback on aspects of this work. Pavel Shevchenko acknowledges support from the Australian Research Council’s Discovery Projects funding scheme (project number: DP160103489).

Author Contributions

Dorota Toczydlowska was the main author of the manuscript and the methodological development and implementations. The co-authors Gareth W. Peters , Man Chung Fung and Pavel V. Shevchenko contributed to aspects of the methodological developments, the derivations, implementations and data analysis.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Bayesian Modelling and Sampling of Demographic Factor Model Extension to the Period-Cohort Stochastic Mortality State-Space Models

In this appendix section we explain the Bayesian models developed for the estimation of the demographic factor model extension to the period-cohort stochastic mortality state-space models that are applied in this paper. These are based on the frameworks detailed and developed in Fung et al. (2016) and Fung et al. (2017).

Appendix A.1. Bayesian Model Development and Inference for Stochastic Mortality Models

The observation and state equations Equations (5a) and (5b) imply that the cohort model that we have formulated here belongs to the linear-Gaussian class of state-space models. As a result one can perform efficient maximum-likelihood or Bayesian estimation on fitting the model to data as discussed in detail in Fung et al. (2016). In this paper we focus on Bayesian inference so that the forecasting distribution can take into account parameter uncertainty.
To achieve these inference goals we must first develop the Bayesian models. In this section we detail the Bayesian estimation of the cohort model Equations (5a) and (5b) and its extensions incorporating population information to state-space formulation Equations (8a) and (8b). We firstly note that the models belong to the linear and Gaussian class of state-space models. As a result one can apply an efficient MCMC estimation algorithm based on Gibbs sampling with conjugate priors combined with forward-backward filtering as described in Fung et al. (2016). We describe procedures using the notation of general cohort model Equations (5a) and (5b) and indicate the differences to considered extensions.
We borrow the notation from the LC cohort model Equations (5a) and (5b) and indicate similarities and differences between Equations (5a) and (5b) and Equations (8a) and (8b) while developing the estimation algorithm. In the general setting, our target density is
π φ 0 : T , ψ | y 1 : T
where φ 0 : T is a vector of latent variables and ψ is a vector of static parameters. In the case of the cohort LC model φ 0 : T : = ( κ 0 : T , γ 0 : T x 1 , , γ 0 : T x p ) is the p + 1 dimensional (for each t) latent state vector and ψ : = ( α x 3 : x p , β x 2 : x p , θ , η , λ , σ ε 2 , σ κ 2 , σ γ 2 ) is the 2 p 1 dimensional static parameter vector. For extended models, we add vectors of global factors latent variable ϱ t and its static model parameters. Recall that our proposed identification constraint is given by Equation (6); therefore only α x 3 : x p and β x 2 : x p are required to be estimated. We perform block sampling for the latent state via the so-called forward-filtering-backward-sampling (FFBS) algorithm (Carter and Kohn 1994) and the posterior samples of the static parameters are obtained via conjugate priors. The sampling procedure is described in Algorithm 2, where N is the number of MCMC iterations performed.
Algorithm 2 MCMC sampling for π ( φ 0 : T , ψ | y 1 : T )
1: Initialise: ψ = ψ ( 0 ) .
2: for i = 1 , , N do
3:   Sample φ 0 : T ( i ) from π ( φ 0 : T | ψ ( i 1 ) , y 1 : T ) via FFBS (Appendix A.2).
4:   for h = 1 , , 2 p 1 do
5:     Sample ψ h ( i ) from π ( ψ h | φ 0 : T ( i ) , ψ h ( i ) , y 1 : T ) , (Appendix A.3)
6:     where ψ h ( i ) = ( ψ 1 ( i ) , , ψ h 1 ( i ) , ψ h + 1 ( i 1 ) , , ψ 2 p 2 ( i 1 ) ) .
7:   end for
8: end for

Appendix A.2. Forward-Backward Filtering for Latent State Dynamics

The FFBS procedure requires to carry out multivariate Kalman filtering forward in time and then sample backwardly using the obtained filtering distributions. For the cohort model Equations (2) and (3) (or Equations (5a) and (5b)), the conditional distributions involved in the multivariate Kalman filtering recursions are given by
φ t 1 | y 1 : t 1 N ( m t 1 , C t 1 ) ,
φ t | y 1 : t 1 N ( a t , R t ) ,
y t | y 1 : t 1 N ( f t , Q t ) ,
φ t | y 1 : t N ( m t , C t )
where
a t = Λ m t 1 + Θ , R t = Λ C t 1 Λ + Y ,
f t = α + B a t , Q t = B R t B + σ ε 2 I p ,
m t = a t + R t B Q t 1 ( y t f t ) , C t = R t R t B Q t 1 B R t .
for t = 1 , , T . Since
π ( φ 0 : T | ψ , a 1 : T ) = t = 0 T π ( φ t | φ t + 1 : T , ψ , a 1 : T ) = t = 0 T π ( φ t | φ t + 1 , ψ , a 1 : t ) ,
We see that for a block sampling of the latent state, one can first draw φ T from N ( m T , C T ) and then, for t = T 1 , , 1 , 0 (that is backward in time), draws a sample of φ t | φ t + 1 , ψ , y 1 : T recursively given a sample of φ t + 1 . It turns out that φ t | φ t + 1 , ψ , y 1 : T N ( h t , H t ) where
h t = m t + C t Λ R t + 1 1 ( φ t + 1 a t + 1 ) ,
H t = C t C t Λ R t + 1 1 Λ C t ,
based on Kalman smoothing (Carter and Kohn 1994). For the extended models, we simply need to replace the vectors and matrices of the LC cohort model with objects stressed by tildes from Section 3.

Appendix A.3. Posteriors for Static Parameters in the Cohort Model

To sample the posterior distribution of the static parameters, we assume the following independent conjugate priors:
α x N ( μ ˜ α , σ ˜ α 2 ) , β x N ( μ ˜ β , σ ˜ β 2 ) , θ N ( μ ˜ θ , σ ˜ θ 2 ) , η N ( μ ˜ η , σ ˜ η 2 )
λ N [ 1 , 1 ] ( μ ˜ λ , σ ˜ λ 2 ) , σ ε 2 IG ( a ˜ ε , b ˜ ε ) , σ κ 2 IG ( a ˜ κ , b ˜ κ ) , σ γ 2 IG ( a ˜ γ , b ˜ γ )
where N [ 1 , 1 ] denotes a truncated Gaussian with support [ 1 , 1 ] and IG ( a ˜ , b ˜ ) denotes an inverse-gamma distribution with mean b ˜ / ( a ˜ 1 ) and variance b ˜ 2 / ( ( a ˜ 1 ) 2 ( a ˜ 2 ) ) for a ˜ > 2 . The posteriors of the static parameters are then obtained as follows: 1
α x | y , φ , ψ α x N σ ˜ α 2 t = 1 T ( y x , t β x κ t γ t x ) + μ ˜ α σ ε 2 σ ˜ α 2 T + σ ε 2 , σ ˜ α 2 σ ε 2 σ ˜ α 2 T + σ ε 2 ,
β x | y , φ , ψ β x N σ ˜ β 2 t = 1 T ( y x , t ( α x + γ t x ) ) κ t + μ ˜ β σ ε 2 σ ˜ β 2 t = 1 T κ t 2 + σ ε 2 , σ ˜ β 2 σ ε 2 σ ˜ β 2 t = 1 T κ t 2 + σ ε 2 ,
θ | y , φ , ψ θ N σ ˜ θ 2 t = 1 T ( κ t κ t 1 ) + μ ˜ θ σ κ 2 σ ˜ θ 2 T + σ κ 2 , σ ˜ θ 2 σ κ 2 σ ˜ θ 2 T + σ κ 2 ,
η | y , φ , ψ θ N σ ˜ η 2 t = 1 T ( γ t λ γ t 1 ) + μ ˜ η σ γ 2 σ ˜ η 2 T + σ γ 2 , σ ˜ η 2 σ γ 2 σ ˜ η 2 T + σ γ 2 ,
λ | y , φ , ψ λ N [ 1 , 1 ] σ ˜ λ 2 t = 1 T ( ( γ t x 1 η ) γ t 1 x 1 ) + μ ˜ λ σ γ 2 σ ˜ λ 2 t = 1 T ( γ t 1 x 1 ) 2 + σ γ 2 , σ ˜ λ 2 σ γ 2 σ ˜ λ 2 t = 1 T ( γ t 1 x 1 ) 2 + σ γ 2 ,
σ ε 2 | y , φ , ψ σ ε 2 IG a ˜ ε + p T 2 , b ˜ ε + 1 2 x = x 1 x p t = 1 T y x , t α x + β x κ t + γ t x 2 ,
σ κ 2 | y , φ , ψ σ κ 2 IG a ˜ κ + T 2 , b ˜ κ + 1 2 t = 1 T κ t ( κ t 1 + θ ) 2 ,
σ γ 2 | y , φ , ψ σ γ 2 IG a ˜ γ + T 2 , b ˜ γ + 1 2 t = 1 T γ t x 1 λ γ t 1 x 1 2 .

Appendix A.4. Posteriors for Static Parameters in the Extended Models

In order to develope sampling algorithm for models incorporating European countries population information, we add the conjugate priori assumptions to Equation (A6) related to static parameters of Equation (7)
[ Ω ] i , j N ( μ ˜ Ω , σ ˜ Ω 2 ) , Ψ j N ( μ ˜ Ψ , σ ˜ Ψ 2 ) , σ ϱ 2 IG ( a ˜ ϱ , b ˜ ϱ )
The posteriors of static parameters from Equation (7) are essential for Gibbs backward sampling regardless of the considered case for the extended model. We refer by small letters i , j to ages x i , x j { x 1 , , x p } , for i , j { 1 , , p } , and letters m , l { 1 , , k } to the components of the matrix F t . An element labelled as i m corresponds to [ F t ] i , m or a latent variable ϱ t i m . Then
Ψ i m | y , φ ˜ , ψ Ψ i m N σ ˜ Ψ 2 t = 1 T ( ϱ t i m ϱ t 1 T [ Ω ] i m , · ) + μ ˜ ψ σ ϱ 2 σ ˜ Ψ 2 T + σ ϱ 2 , σ ˜ Ψ 2 σ ϱ 2 σ ˜ Ψ 2 T + σ ϱ 2 , [ Ω ] i m , j l | y , φ ˜ , ψ [ Ω ] i m , j l N σ ˜ Ω 2 t = 1 T ϱ t i m Ψ i m x h j m ϱ t 1 x h [ Ω ] i m , x h ϱ t 1 j l + μ ˜ Ω σ ϱ 2 σ ˜ Ω 2 t = 1 T ( ϱ t 1 j l ) 2 + σ ϱ 2 , σ ˜ Ω 2 σ ϱ 2 σ ˜ Ω 2 t = 1 T ( ϱ t 1 j l ) 2 + σ ϱ 2 , σ ϱ 2 | y , φ ˜ , ψ σ ϱ 2 IG a ˜ ϱ + p k T 2 , b ˜ ϱ + 1 2 t = 1 T i = 1 p m = 1 k ϱ t i m Ψ i m ϱ t 1 T [ Ω ] i m , · 2 ,
where φ ˜ is a vector of latent variables from Equation (8b) and ψ is a vector of static parameters updated to Equation (7). Depends on the cases f extended model, we have the following replacement of the posterioris from Appendix A.3
Case 1
Global factors F t in the observation equation
α x | y , φ ˜ , ψ α x N σ ˜ α 2 t = 1 T y x , t [ B ˜ t ] x , · T φ t ˜ + μ ˜ α σ ε 2 σ ˜ α 2 T + σ ε 2 , σ ˜ α 2 σ ε 2 σ ˜ α 2 T + σ ε 2 ,
β x | y , φ ˜ , ψ β x N σ ˜ β 2 t = 1 T y x , t α x + γ t x + [ F ˜ t ] x , · T ϱ t κ t + μ ˜ β σ ε 2 σ ˜ β 2 t = 1 T κ t 2 + σ ε 2 , σ ˜ β 2 σ ε 2 σ ˜ β 2 t = 1 T κ t 2 + σ ε 2
σ ε 2 | y , φ ˜ , ψ σ ε 2 IG a ˜ ε + p T 2 , b ˜ ε + 1 2 x = x 1 x p t = 1 T y x , t α x + [ B ˜ t ] x , · T φ t ˜ 2
Case 2
Global factors F t in the state equation of κ t
θ | y , φ ˜ , ψ θ N σ ˜ θ 2 t = 1 T ( κ t κ t 1 f ˜ t T ϱ t ) + μ ˜ θ σ κ 2 σ ˜ θ 2 T + σ κ 2 , σ ˜ θ 2 σ κ 2 σ ˜ θ 2 T + σ κ 2 ,
σ κ 2 | y , φ , ψ σ κ 2 IG a ˜ κ + T 2 , b ˜ κ + 1 2 t = 1 T κ t κ t 1 + θ + f ˜ t T ϱ t 2
Case 3
Global factors F t in the state equation of γ t
η | y , φ ˜ , ψ θ N σ ˜ η 2 t = 1 T γ t x 1 λ γ t 1 x 1 ϱ t T [ F ˜ t ] x 1 , · + μ ˜ η σ γ 2 σ ˜ η 2 T + σ γ 2 , σ ˜ η 2 σ γ 2 σ ˜ η 2 T + σ γ 2 ,
λ | y , φ , ψ λ N [ 1 , 1 ] σ ˜ λ 2 t = 1 T ( ( γ t x 1 η ϱ t T [ F ˜ t ] x 1 , · ) γ t 1 x 1 ) + μ ˜ λ σ γ 2 σ ˜ λ 2 t = 1 T ( γ t 1 x 1 ) 2 + σ γ 2 , σ ˜ λ 2 σ γ 2 σ ˜ λ 2 t = 1 T ( γ t 1 x 1 ) 2 + σ γ 2 ,
σ γ 2 | y , φ , ψ σ γ 2 IG a ˜ γ + T 2 , b ˜ γ + 1 2 t = 1 T γ t x 1 λ γ t 1 x 1 2 .

Appendix A.5. Application of the Constraints

The constraints are applied for every iteration of the sampler. Let the current iteration be denoted by i, then the following procedures are performed:
  • The constraints of the vector β ( i ) are applied after sampling the arbitrary vector of static parameters β ( i ) , the vector is mapped into a vector of transformed parameters, β ˜ ( i ) by the following rescaling β ˜ ( i ) = β ( i ) x β x ( i ) . Then we replace β ( i ) with β ˜ ( i ) and proceed to the next steps of the sampler.
  • The constraints for the latent processes κ t ( i ) and γ t x are applied after the finalisation of Forward Backward. The arbitrary filtered estimates of the processes are transformed to κ ˜ t ( i ) = κ t ( i ) κ ¯ ( i ) and γ ˜ t x ( i ) = γ t x ( i ) γ ¯ ( i ) for κ ¯ ( i ) = 1 N k = 1 N κ t k ( i ) and γ ¯ ( i ) = 1 N + p 1 ) c = t 1 x p t N x 1 γ c ( i ) . Then we replace κ t ( i ) with κ ˜ t ( i ) and γ t ( i ) with γ ˜ t ( i ) and and proceed to the next steps of the sampler.
If one models the full cohort model the constraints of the vector β γ are applied applied in the same fashion as for the vector β . For a simplified cohort model, the vector of parameters is set to the vector of ones and is not sampled.

Appendix B. Description of Stochastic Mortality Models Utilizing Factor Extraction from European Demographic Data

Recalling the notation from the Equation (8), all models which utilize features extracted from Demographic data have the following the state-space representation
y t = α + B ˜ t φ ˜ t + ε t , ε t i i d N ( 0 , σ ε 2 I 21 ) ,
φ ˜ t = Λ ˜ φ ˜ t 1 + Θ ˜ + ω ˜ t , κ ˜ t i i d N ( 0 , Y ˜ ) .
As we examine only Case 1 defined in Section 3, the corresponding transition matrices of the observation and state equations are equal to
B ˜ t 21 × ( 22 + m ) = B 21 × 22 F ˜ t Λ ˜ ( 22 + m ) × ( 22 + m ) = Λ 22 × 22 0 22 × m 0 m × 22 Ω m × m
where m is the dimensionality of the latent process vector ϱ t which corresponds to the factor matrix F ˜ t . The structure of the matrix depends on the models what is further discussed in the next two subsections.

Appendix B.1. DFM-PC-B Model

The model is constructed as follow
Step 1:
Take the first eigenvector of robustly standardized Birth counts which is of vector-type, country specific
[ F ] · , 1 = f 1 , 1 A U T f 21 , 1 U K R ;
Step 2:
Take the mean across the countries (components of the first eigenvector vector) which is a scalar f ^ and use a one per age group latent process to model additional supplementary information per age group;
Step 3:
Incorporate f ^ as elements of the diagonal matrix F ˜
F ˜ = f ^ 0 0 0 0 f ^ 0 0 0 0 0 0 0 0 0 0 0 0 0 f ^ 21 × 21
The corresponding ϱ t is age group specific vector, that is
ϱ t = ϱ t 0 , , ϱ t 95 1 × 21

Appendix B.2. The Models of the Class DFM-PC-D and DFM-PC-Mx

The models are constructed as follow
Step 1:
Take the first eigenvector of robustly standardized corresponding data set which is of matrix-type, age and country specific
F = f 1 , 1 A U T f 1 , 28 U K R f 21 , 1 A U T f 21 , 28 U K R 21 × 28
Step 2:
Notice that the matrix F ˜ is equal to F
Step 3:
Use a one per country latent process to model the impact of country specific vector;
The corresponding ϱ t is country specific vector, that is
ϱ t = ϱ t A U T , , ϱ t U K R 1 × 28

References

  1. Basilevsky, Alexander T. 1994. Statistical Factor Analysis and Related Methods. Hoboken: John Wiley & Sons, Inc. [Google Scholar]
  2. Cairns, Andrew J. G., David Blake, Kewin Dowd, Guy D. Coughlan, David Epstein, Alen Ong, and Igor Balevich. 2009. A quantitative comparison of stochastic mortality models using data from England and Wales and the United States. North American Actuarial Journal 13: 1–35. [Google Scholar] [CrossRef]
  3. Carter, Chris K., and Robert Kohn. 1994. On Gibbs sampling for state-space models. Biometrika 81: 541–53. [Google Scholar] [CrossRef]
  4. Celeux, G., F. Forbes, C. P. Robert, and D. M. Titterington. 2006. Deviance information criteria for missing data models. Bayesian Anal 1: 651–73. [Google Scholar] [CrossRef]
  5. Davies, P. Laurie. 1987. Asymptotic Behaviour of S-Estimates of Multivariate Location Parameters and Dispersion Matrices. The Annals of Statistics 15: 1269–92. [Google Scholar] [CrossRef]
  6. Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of Royal Statistical Society. Series B (Methodological) 39: 1–38. [Google Scholar]
  7. Diebold, Francis X., and Canlin Li. 2006. Forecasting the term structure of government bond yields. Journal of Econometrics 130: 337–64. [Google Scholar] [CrossRef]
  8. Erbas, Bircan, Muhammed Akram, Dorota M. Gertig, Dallas English, John L. Hopper, Anne M. Kavanagh, and Rob Hyndman. 2010. Using functional data analysis models to estimate future time trends in age-specific breast cancer mortality for the United States and England-Wales. Journal of Epidemiology 20: 159–65. [Google Scholar] [CrossRef] [PubMed]
  9. Frahm, Gabriel, and Uwe Jaekel. 2010. A generalization of Tyler’s M-estimators to the case of incomplete data. Computational Statistics & Data Analysis 54: 374–93. [Google Scholar]
  10. Friedman, Jerome H., and John W. Tukey. 1974. A Projection Pursuit Algorithm for Exploratory Data Analysis. IEEE Transactions on Computers C-23: 881–90. [Google Scholar] [CrossRef]
  11. Fung, Man Chung, Gareth W. Peters, and Pavel V. Shevchenko. 2016. A unified approach to mortality modelling using state-space framework: Characterisation, identification, estimation and forecasting. Annals of Actuarial Science, 1–47. [Google Scholar] [CrossRef]
  12. Fung, Man Chung, Gareth William Peters, and Pavel V. Shevchenko. 2017. Cohort Effects in Mortality Modelling: A Bayesian State-Space Analysis. Available online: https://ssrn.com/abstract=2907868 (accessed on 31 January 2017 ).
  13. Gaille, Sererine Arnold, and Michael Sherris. 2015. Causes-of-Death Mortality: What Do We Know on Their Dependence? North American Actuarial Journal 19: 116–28. [Google Scholar] [CrossRef]
  14. Gavrilov, Leonid A., and Natalia S. Gavrilova. 2015. Predictors of Exceptional Longevity: Effects of Early-Life and Midlife Conditions, and Familial Longevity. North America Actuarial Journal 19: 174–86. [Google Scholar] [CrossRef] [PubMed]
  15. Girosi, Federico, and Gary King. 2008. Demographic Forecasting. Princeton: Princeton University Press. [Google Scholar]
  16. Haberman, Steven, and Arthur Renshaw. 2011. A comparative study of parametric mortality projection models. Insurance: Mathematics and Economics 48: 35–55. [Google Scholar] [CrossRef]
  17. Hanewald, Katja. 2011. Explaining Mortality Dynamics: The Role of Macroeconomic Fluctuations and Cause of Death Trends. North American Actuarial Journal, Series B, 290–314. [Google Scholar] [CrossRef]
  18. Huber, Peter J. 1964. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35: 73–101. [Google Scholar] [CrossRef]
  19. Huber, Peter J., and Elvezio M. Ronchetti. 2009. Robust Statistics. Wiley Series in Probability and Statistics; Hoboken: John Wiley & Sons, Inc., p. 380. [Google Scholar]
  20. Hunt, Andrew, and Andres M. Villegas. 2015. Robustness and convergence in the Lee-Carter model with cohort effects. Insurance: Mathematics and Economics 64: 186–202. [Google Scholar] [CrossRef]
  21. Hyndman, Rob J., and Farah Yasmeen. 2012. Common functional principal component models for mortality forecasting. In Contributions in Infinite-Dimensional Statistics And Related Topics. chp. 29. pp. 161–166. [Google Scholar]
  22. Jamshidian, Mortaza. 1997. An EM Algorithm for ML Factor Analysis with Missing Data. In Lecture Notes in Statistics. New York: Springer, pp. 247–58. [Google Scholar]
  23. Jolliffe, I. T. 2002. Principal Component Analysis. New York: Springer. [Google Scholar]
  24. Kogure, Atsuyuki, and Yoshiyuki Kurachi. 2010. A Bayesian approach to pricing longevity risk based on risk-neutral predictive distributions. Insurance: Mathematics and Economics 46: 162–72. [Google Scholar] [CrossRef]
  25. Lee, Ronald D., and Lawrence R. Carter. 1992. Modeling and forecasting US mortality. Journal of the American Statistical Association 87: 659–75. [Google Scholar]
  26. Little, Roderick J. A., and Donald B. Rubin. 2002. Statistical Analysis with Missing Data, 2nd ed. Hoboken: John Wiley & Sons, Inc. [Google Scholar]
  27. Lopuhaa, Hendrik P. 1989. On the Relation between S-Estimators and M-Estimators of Multivariate Location and Covariance. The Annals of Statistics 17: 1662–83. [Google Scholar] [CrossRef]
  28. Maronna, Ricardo Antonio. 1976. Robust M-Estimators of Multivariate Location and Scatter. The Annals of Statistics 4: 51–67. [Google Scholar] [CrossRef]
  29. Murray, Christopher J. L., and Alan D. Lopez. 1997. Alternative projections of mortality and disability by cause 1990–2020: Global burden of disease study. The Lancet, 1498–1504. [Google Scholar] [CrossRef]
  30. Niu, Geng, and Bertrand Melenberg. 2014. Trends in Mortality Decrease and Economic Growth. Demography 51: 1755–73. [Google Scholar] [CrossRef] [PubMed]
  31. Pedroza, Claudia. 2006. A Bayesian forecasting model: predicting US male mortality. Biostatistics 7: 530–50. [Google Scholar] [CrossRef] [PubMed]
  32. Renshaw, Arthur E., and Steven Haberman. 2006. A cohort-based extension to the Lee-Carter model for mortality reduction factors. Insurance: Mathematics and Economics 38: 556–70. [Google Scholar] [CrossRef]
  33. Rousseeuw, Peter, and Victor Yohai. 1984. Robust Regression by Means of S-Estimators. Robust and Nonlinear Time Series Analysis; New York: Springer, pp. 256–72. [Google Scholar]
  34. Roweis, Sam T. 1998. EM Algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems. Cambridge: MIT Press, pp. 626–32. [Google Scholar]
  35. Rubin, Donald B., and Dorothy T. Thayer. 1982. EM algorithms for ML factor analysis. Psychometrika 47: 69–76. [Google Scholar] [CrossRef]
  36. Shang, Han Lin, and Rob J. Hyndman. 2016. Grouped functional time series forecasting: An application to age-specific mortality rates. Journal of Computational and Graphical Statistics 26: 330–43. [Google Scholar] [CrossRef]
  37. Spiegelhalter, David J., Nicola G. Best, and Bradley P. Carlin. 2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B 64: 583–639. [Google Scholar] [CrossRef]
  38. Sun, Ying, Prabhu Babu, and Daniel P. Palomar. 2016. Robust Estimation of Structured Covariance Matrix for Heavy-Tailed Elliptical Distributions. IEEE Transactions on Signal Processing 64: 3576–90. [Google Scholar] [CrossRef]
  39. Tipping, Michael E., and Christopher M. Bishop. 1999. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61: 611–22. [Google Scholar] [CrossRef]
  40. Tyler, David E. 1987. A Distribution-Free M-Estimator of Multivariate Scatter. The Annals of Statistics 15: 234–51. [Google Scholar] [CrossRef]
  41. Tyler, David E. 1987. Statistical Analysis for the Angular Central Gaussian Distribution on the Sphere. Biometrika 74: 579. [Google Scholar] [CrossRef]
  42. Willets, R. C. 2004. The cohort effect: insights and explanations. British Actuarial Journal 10: 833–77. [Google Scholar] [CrossRef]
  43. Wilmoth, J. R., K. Andreev, and D. Jdanov. 2007. Methods Protocol for the Human Mortality Database. Technical Report. Available online: http://www.mortality.org/Public/Docs/MethodsProtocol.pdf (accessed on 27 July 2017).
1.
For simplicity, we denote y = y 1 : T , φ = φ 0 : T and ψ h = ( ψ 1 , , ψ h 1 , ψ h + 1 , , ψ 2 p 2 ) .
Figure 1. The percentage of missing entries (y axis) per observation vector over time (x axis) for the Births counts (left plot) and Deaths counts (right plot) for female (blue line) and male (blacke line) population. Red vertical lines correspond to the starting points in time of samples when maximum of missing entries is equal to (from the left side on corresponding plots) 75%, 50% and 25%.
Figure 1. The percentage of missing entries (y axis) per observation vector over time (x axis) for the Births counts (left plot) and Deaths counts (right plot) for female (blue line) and male (blacke line) population. Red vertical lines correspond to the starting points in time of samples when maximum of missing entries is equal to (from the left side on corresponding plots) 75%, 50% and 25%.
Risks 05 00042 g001
Figure 2. The indicator of a missing value (black colour) per country (y axis) over time (x axis) for the Births counts for female population. Red vertical lines correspond to the starting points in time when samples with maximum of missing entries is equal to (from the left side on corresponding plots) 75%, 50% and 25%.
Figure 2. The indicator of a missing value (black colour) per country (y axis) over time (x axis) for the Births counts for female population. Red vertical lines correspond to the starting points in time when samples with maximum of missing entries is equal to (from the left side on corresponding plots) 75%, 50% and 25%.
Risks 05 00042 g002
Figure 3. Percentages of missing values (denoted by diffrent colours) per observation for Number of Deaths for Females per age groups (y axis) over time (x axis). The titles of the subplots indicate the case of missing values (50%, 75%). The percentage for a given country and given age group is computed dividing number of missing values by number of countries. Red vertical lines correspond to the starting points in time when the cases 50% and 25% start (from the left to right side on corresponding plots).
Figure 3. Percentages of missing values (denoted by diffrent colours) per observation for Number of Deaths for Females per age groups (y axis) over time (x axis). The titles of the subplots indicate the case of missing values (50%, 75%). The percentage for a given country and given age group is computed dividing number of missing values by number of countries. Red vertical lines correspond to the starting points in time when the cases 50% and 25% start (from the left to right side on corresponding plots).
Risks 05 00042 g003
Figure 4. Percentages of missing values (denoted by diffrent colours) per observation for Number of Deaths for Males per age groups (y axis) over time (x axis). The titles of the subplots indicate the case of missing values (50%, 75%). The percentage for a given country and given age group is computed dividing number of missing values by number of countries. Red vertical lines correspond to the starting points in time when the cases 50% and 25% starts (from the left to right side on corresponding plots).
Figure 4. Percentages of missing values (denoted by diffrent colours) per observation for Number of Deaths for Males per age groups (y axis) over time (x axis). The titles of the subplots indicate the case of missing values (50%, 75%). The percentage for a given country and given age group is computed dividing number of missing values by number of countries. Red vertical lines correspond to the starting points in time when the cases 50% and 25% starts (from the left to right side on corresponding plots).
Risks 05 00042 g004
Figure 5. Percentages of missing values (denoted by diffrent colours) for Number of Deaths for Females per country (x axis) and age group (y axis). The titles of the subplots indicate the case of missing values (25%, 50%, 75%). The percentage of missing values for a given country and an age group is calculated dividing number of missing values by length of subsample which is different for different cases.
Figure 5. Percentages of missing values (denoted by diffrent colours) for Number of Deaths for Females per country (x axis) and age group (y axis). The titles of the subplots indicate the case of missing values (25%, 50%, 75%). The percentage of missing values for a given country and an age group is calculated dividing number of missing values by length of subsample which is different for different cases.
Risks 05 00042 g005
Figure 6. Percentages of missing values (denoted by diffrent colours) for Number of Deaths for Males per country (x axis) and age group (y axis). The titles of the subplots indicate the case of missing values (25%, 50%, 75%). The percentage of missing values for a given country and an age group is calculated dividing number of missing values by length of subsample which is different for different cases.
Figure 6. Percentages of missing values (denoted by diffrent colours) for Number of Deaths for Males per country (x axis) and age group (y axis). The titles of the subplots indicate the case of missing values (25%, 50%, 75%). The percentage of missing values for a given country and an age group is calculated dividing number of missing values by length of subsample which is different for different cases.
Risks 05 00042 g006
Figure 7. The Mahalanobias distances obtained using Probabilistic Principal Component Analysis (PPCA) for Females (a) and Males (b) Births over time (x axis). Different colours of lines correspond to the cases of different percentages of maximal missing values in a signle observation (light blue (75%), dark brown (50%), dark blue (25%), light brown (0%)). Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Figure 7. The Mahalanobias distances obtained using Probabilistic Principal Component Analysis (PPCA) for Females (a) and Males (b) Births over time (x axis). Different colours of lines correspond to the cases of different percentages of maximal missing values in a signle observation (light blue (75%), dark brown (50%), dark blue (25%), light brown (0%)). Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Risks 05 00042 g007
Figure 8. The eigenvalues obtained using PPCA for Females (a) and Males (b) population of Number of Birthsfor different percentages of maximal missing entries in rows (x axis). Colours of lines corresponds to different eigenvalues, first (light brown), second (dark blue) and third (dark brown) highest. Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Figure 8. The eigenvalues obtained using PPCA for Females (a) and Males (b) population of Number of Birthsfor different percentages of maximal missing entries in rows (x axis). Colours of lines corresponds to different eigenvalues, first (light brown), second (dark blue) and third (dark brown) highest. Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Risks 05 00042 g008
Figure 9. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for Females in Births. Every row of subfigure corresponds to a different eigenvector. Every column corresponds to different case of missing values (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Figure 9. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for Females in Births. Every row of subfigure corresponds to a different eigenvector. Every column corresponds to different case of missing values (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Risks 05 00042 g009
Figure 10. The scores (y axis) over time (x axis) obtained using PPCA for Females in Births. Colours of lines correspond to the scores calculated on subsample different cases of missing values (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots scores, first, second and thirds are ordered by columns.
Figure 10. The scores (y axis) over time (x axis) obtained using PPCA for Females in Births. Colours of lines correspond to the scores calculated on subsample different cases of missing values (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots scores, first, second and thirds are ordered by columns.
Risks 05 00042 g010
Figure 11. The Mahalanobias distances obtained using PPCA for Females (a) and Males (b) of Life Expectancy at Birth over time (x axis). Different colours of lines correspond to the cases of different percentages of maximal missing entries in rows (light blue (75%), dark brown (50%), dark blue (25%), light brown (0%)). Every subfigure is divided into two subplots corresponding to the robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Figure 11. The Mahalanobias distances obtained using PPCA for Females (a) and Males (b) of Life Expectancy at Birth over time (x axis). Different colours of lines correspond to the cases of different percentages of maximal missing entries in rows (light blue (75%), dark brown (50%), dark blue (25%), light brown (0%)). Every subfigure is divided into two subplots corresponding to the robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Risks 05 00042 g011
Figure 12. The eigenvalues obtained using PPCA for Females (a) and Males (b) of Life Expectancy at Birth data for different percentages of maximal missing entries in rows (x axis). Colours of lines corresponds to different eigenvalues, first (light brown), second (dark blue) and third (dark brown) highest. Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Figure 12. The eigenvalues obtained using PPCA for Females (a) and Males (b) of Life Expectancy at Birth data for different percentages of maximal missing entries in rows (x axis). Colours of lines corresponds to different eigenvalues, first (light brown), second (dark blue) and third (dark brown) highest. Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Risks 05 00042 g012
Figure 13. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for female population of Life Expectancy at Birth. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Figure 13. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for female population of Life Expectancy at Birth. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Risks 05 00042 g013
Figure 14. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for male population of Life Expectancy at Birth. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Figure 14. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for male population of Life Expectancy at Birth. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Risks 05 00042 g014
Figure 15. The scores (y axis) over time (x axis) obtained using PPCA for female population of Life Expectancy at Birth. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Figure 15. The scores (y axis) over time (x axis) obtained using PPCA for female population of Life Expectancy at Birth. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Risks 05 00042 g015
Figure 16. The scores (y axis) over the time (x axis) obtained using PPCA for Males of Life Expectancy at Birth. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Figure 16. The scores (y axis) over the time (x axis) obtained using PPCA for Males of Life Expectancy at Birth. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Risks 05 00042 g016
Figure 17. The Mahalanobias distances obtained using PPCA for Females (a) and Males (b) of Number of Death over time (x axis). Different colours of lines correspond to the cases of different percentages of maximal missing entries in rows (light blue (75%), dark brown (50%), dark blue (25%), light brown (0%)). Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Figure 17. The Mahalanobias distances obtained using PPCA for Females (a) and Males (b) of Number of Death over time (x axis). Different colours of lines correspond to the cases of different percentages of maximal missing entries in rows (light blue (75%), dark brown (50%), dark blue (25%), light brown (0%)). Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Risks 05 00042 g017
Figure 18. The eigenvalues of Deaths counts obtained using PPCA for Females (a) and Males (b) over diferent cases of missing entries (x axis). Colours of lines corresponds to different eigenvalues, first (light brown), second (dark blue) and third (dark brown) highest. Every subfigure is divided into two subplots corresponding to robust estimation of standard deviations (upper plot) and sample one (bottom plot).
Figure 18. The eigenvalues of Deaths counts obtained using PPCA for Females (a) and Males (b) over diferent cases of missing entries (x axis). Colours of lines corresponds to different eigenvalues, first (light brown), second (dark blue) and third (dark brown) highest. Every subfigure is divided into two subplots corresponding to robust estimation of standard deviations (upper plot) and sample one (bottom plot).
Risks 05 00042 g018
Figure 19. The eigenvectors of Death counts (y axis) over age groups (y axis) and countries (x axis) obtained using PPCA for Females. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Figure 19. The eigenvectors of Death counts (y axis) over age groups (y axis) and countries (x axis) obtained using PPCA for Females. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Risks 05 00042 g019
Figure 20. The eigenvectors of Death counts (y axis) over age groups (y axis) and countries (x axis) obtained using PPCA for Males . Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Figure 20. The eigenvectors of Death counts (y axis) over age groups (y axis) and countries (x axis) obtained using PPCA for Males . Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Risks 05 00042 g020
Figure 21. The scores (y axis) over time (x axis) obtained using PPCA for female population of Number of Deaths. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Figure 21. The scores (y axis) over time (x axis) obtained using PPCA for female population of Number of Deaths. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Risks 05 00042 g021
Figure 22. The scores (y axis) over time (x axis) obtained using PPCA for male population of Number of Deaths. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Figure 22. The scores (y axis) over time (x axis) obtained using PPCA for male population of Number of Deaths. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Risks 05 00042 g022
Figure 23. The Mahalanobias distances obtained using PPCA for female (a) and male (b) population of Death Rates over time (x axis). Different colours of lines correspond to the cases of different percentages of maximal missing entries in rows (light blue (75%), dark brown (50%), dark blue (25%), light brown (0%)). Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Figure 23. The Mahalanobias distances obtained using PPCA for female (a) and male (b) population of Death Rates over time (x axis). Different colours of lines correspond to the cases of different percentages of maximal missing entries in rows (light blue (75%), dark brown (50%), dark blue (25%), light brown (0%)). Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Risks 05 00042 g023
Figure 24. The eigenvalues obtained using PPCA for female (a) and male (b) population of Death Rates for different percentages of maximal missing entries in rows (x axis). Colours of lines corresponds to different eigenvalues, first (light brown), second (dark blue) and third (dark brown) highest. Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Figure 24. The eigenvalues obtained using PPCA for female (a) and male (b) population of Death Rates for different percentages of maximal missing entries in rows (x axis). Colours of lines corresponds to different eigenvalues, first (light brown), second (dark blue) and third (dark brown) highest. Every subfigure is divided into two subplots corresponding to robust estimation of standard divinations (upper plot) and sample one (bottom plot).
Risks 05 00042 g024
Figure 25. The scores (y axis) over time (x axis) obtained using PPCA for female population of Death Rates. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Figure 25. The scores (y axis) over time (x axis) obtained using PPCA for female population of Death Rates. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Risks 05 00042 g025
Figure 26. The scores (y axis) over time (x axis) obtained using PPCA for male population of Death Rates. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Figure 26. The scores (y axis) over time (x axis) obtained using PPCA for male population of Death Rates. Colours of lines correspond to the scores calculated on subsample where are different levels of maximum missing values per observation (0%, 25%, 50% and 75%, refer to legend). The plots placed in the first row correspond to the results using non- robust standardization of entry data, where the second row correspond to robust standardisation. The plots of scores are ordered by columns.
Risks 05 00042 g026
Figure 27. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for female population of Death Rates. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Figure 27. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for female population of Death Rates. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Risks 05 00042 g027
Figure 28. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for male population of Death Rates. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Figure 28. The eigenvectors (y axis) over the joint distribution of countries (x axis) obtained using PPCA for male population of Death Rates. Every row of subfigure corresponds to different eigenvector. Every column corresponds to different level of maximum missing values per observation (0%, 25%, 50% and 75%). The blue line corresponds to robust standardisation whereas red line to non-robust standardisation of data.
Risks 05 00042 g028
Figure 29. Bayesian posterior mean estimators with 95% posterior credible intervals for the estimation of the age-specific diagonal elements of the transition matrix Ω (x axis ) under DFM-PC-B.
Figure 29. Bayesian posterior mean estimators with 95% posterior credible intervals for the estimation of the age-specific diagonal elements of the transition matrix Ω (x axis ) under DFM-PC-B.
Risks 05 00042 g029
Figure 30. Bayesian posterior estimators with 95% posterior credible intervals for the estimation of α and β under different models (colours of lines) for British female mortality data (1922–2002).
Figure 30. Bayesian posterior estimators with 95% posterior credible intervals for the estimation of α and β under different models (colours of lines) for British female mortality data (1922–2002).
Risks 05 00042 g030
Figure 31. Bayesian posterior mean estimators with 95% posterior credible intervals for the estimation of the diagonal elements of the transition matrix Ω (x axis ) under DFM-PC-D-r, DFM-PC-D-s, DFM-PC-Mx-r and DFM-PC-Mx-s models (colours of lines). The dashed blue line divides the set of countries into developed (on the left side) and developing (on the right hand side) European countries, respectively.
Figure 31. Bayesian posterior mean estimators with 95% posterior credible intervals for the estimation of the diagonal elements of the transition matrix Ω (x axis ) under DFM-PC-D-r, DFM-PC-D-s, DFM-PC-Mx-r and DFM-PC-Mx-s models (colours of lines). The dashed blue line divides the set of countries into developed (on the left side) and developing (on the right hand side) European countries, respectively.
Risks 05 00042 g031
Figure 32. The Bayesian posterior mean estimates with 95% posterior credible intervals for κ t (upper panel) and cohort effect state process γ t 0 (lower panel) under different models (colours of lines) for British female log death rates during 1922–2002.
Figure 32. The Bayesian posterior mean estimates with 95% posterior credible intervals for κ t (upper panel) and cohort effect state process γ t 0 (lower panel) under different models (colours of lines) for British female log death rates during 1922–2002.
Risks 05 00042 g032
Figure 33. The Bayesian posterior mean estimates for ϱ t across age groups (y axis) over time (x axis) under DFM-PC-B model for British female log death rates during 1922–2002.
Figure 33. The Bayesian posterior mean estimates for ϱ t across age groups (y axis) over time (x axis) under DFM-PC-B model for British female log death rates during 1922–2002.
Risks 05 00042 g033
Figure 34. The Bayesian posterior mean estimates for the cohort effect latent processes vector γ t across age groups (y axis) over time (x axis) under different models for British female log death rates during 1922–2002.
Figure 34. The Bayesian posterior mean estimates for the cohort effect latent processes vector γ t across age groups (y axis) over time (x axis) under different models for British female log death rates during 1922–2002.
Risks 05 00042 g034
Figure 35. The Bayesian posterior mean estimates for ϱ t across countries (y axis) over time (x axis) under the models from the classes DFM-PC-D and DFM-PC-Mx for British female log death rates during 1922–2002. The vertical blue line divides sets into developed (on the left side) and developing (on the right sie) European countries.
Figure 35. The Bayesian posterior mean estimates for ϱ t across countries (y axis) over time (x axis) under the models from the classes DFM-PC-D and DFM-PC-Mx for British female log death rates during 1922–2002. The vertical blue line divides sets into developed (on the left side) and developing (on the right sie) European countries.
Risks 05 00042 g035
Figure 36. The Bayesian posterior mean estimates with 95% posterior credible intervals for ϱ t across countries (different panels) overtime (x axis) under the models from the classes DFM-PC-D and DFM-PC-Mx (colours of lines) for British female log death rates during 1922–2002.
Figure 36. The Bayesian posterior mean estimates with 95% posterior credible intervals for ϱ t across countries (different panels) overtime (x axis) under the models from the classes DFM-PC-D and DFM-PC-Mx (colours of lines) for British female log death rates during 1922–2002.
Risks 05 00042 g036
Figure 37. 10-year out-of-sample forecasted log death (y axis) rates of different age groups (different subplots) under different models (colours of lines) with corresponding prediction intervals. Calibration period: 1922–2002
Figure 37. 10-year out-of-sample forecasted log death (y axis) rates of different age groups (different subplots) under different models (colours of lines) with corresponding prediction intervals. Calibration period: 1922–2002
Risks 05 00042 g037
Figure 38. Bayesian posterior estimators with 95% posterior credible intervals for the estimation of α and β under different models (colours of lines) for British female mortality data (1922–2002).
Figure 38. Bayesian posterior estimators with 95% posterior credible intervals for the estimation of α and β under different models (colours of lines) for British female mortality data (1922–2002).
Risks 05 00042 g038
Figure 39. The Bayesian posterior mean estimates with 95% posterior credible intervals for κ t (upper panel) and cohort effect state process γ t 0 (lower panel) under different models (colours of lines) for British female log death rates during 1922–2002.
Figure 39. The Bayesian posterior mean estimates with 95% posterior credible intervals for κ t (upper panel) and cohort effect state process γ t 0 (lower panel) under different models (colours of lines) for British female log death rates during 1922–2002.
Risks 05 00042 g039
Figure 40. The Bayesian posterior mean estimates with 95% posterior credible intervals average over time for α + F ˜ t ϱ t for DFM-PC-D-r.
Figure 40. The Bayesian posterior mean estimates with 95% posterior credible intervals average over time for α + F ˜ t ϱ t for DFM-PC-D-r.
Risks 05 00042 g040
Figure 41. The Bayesian posterior mean estimates of α + F ˜ t ϱ t T over time for DFM-PC-D-r. Colours of lines are related to the age groups (the elements of the vector α ) .
Figure 41. The Bayesian posterior mean estimates of α + F ˜ t ϱ t T over time for DFM-PC-D-r. Colours of lines are related to the age groups (the elements of the vector α ) .
Risks 05 00042 g041
Table 1. The availability of the demographic data per country (Human Mortality Database).
Table 1. The availability of the demographic data per country (Human Mortality Database).
CountryLife Expectancy ( E 0 )No. BirthsDeath Rate ( m x )No. Deaths
Austria1947–20141871–20141947–20141947–2014
Belarus1959–20141959–20141959–20141959–2014
Belgium1841–20151840–20151841–20151841–2015
Czech Republic1950–20101947–20141950–20141950–2014
Denmark1835–20141835–20141835–20141835–2014
Estonia1959–20131959–20131959–20131959–2013
Finland1878–20121865–20121878–20121878–2012
France1816–20141806–20141816–20141816–2014
East Germany1956–20131946–20131956–20131956–2013
West Germany1956–20131946–20131956–20131956–2013
Greece1981–20131981–20131981–20131981–2013
Estonia1959–20131959–20131959–20131959–2013
Hungary1950–20141950–20141950–20141950–2014
Iceland1838–20131838–20131838–20131838–2013
Ireland1950–20141950–20141950–20141950–2014
Italy1872–20121862–20121872–20121872–2012
Latvia1959–20131959–20131959–20131959–2013
Lithuania1959–20131959–20131959–20131959–2013
Luxembourg1960–20141950–20141960–20141960–2014
Netherlands1850–20121850–20121850–20121850–2012
Norway1846–20141846–20141846–20141846–2014
Poland1958–20141958–20141958–20141958–2014
Portugal1940–20121886–20121940–20121940–2012
Russia1959–20141959–20141959–20141959–2014
Slovakia1950–20141950–20141950–20141950–2014
Slovenia1983–20141983–20141983–20141983–2014
Spain1908–20141908–20141908–20141908–2014
Sweden1751–20141747–20141751–20141751–2014
Switzerland1876–20141871–20141876–20141876–2014
United Kingdom1922–20131922–20131922–20131922–2013
Ukraine1959–20131946–20131959–20131959–2013
Table 2. Bayesian posterior mean estimators with 95% posterior credible intervals for the estimation of the static parameters λ , θ , η , σ ϵ 2 , σ γ 2 , σ κ 2 , σ ϱ 2 of log m x , t .
Table 2. Bayesian posterior mean estimators with 95% posterior credible intervals for the estimation of the static parameters λ , θ , η , σ ϵ 2 , σ γ 2 , σ κ 2 , σ ϱ 2 of log m x , t .
Model λ θ η σ ϵ 2 σ γ 2 σ κ 2 σ ϱ 2
LCC0.998 (0.994; 1) 0.154 ( 0.331 ; 0.026) 0.024 ( 0.034 ; 0.014 )6.4×10 3 2×10 3 0.663 (0.449; 0.96)
(6×10 3 ; 6.9×10 3 )(1.4×10 3 ; 2.8×10 3 )
DFM-PC-B0.991 (0.968; 1) 0.332 ( 0.53 ; 0.137 ) 0.005 ( 0.01 ; 0.002)8×10 4 5×10 4 0.753 (0.537; 1.055)0.049 (0.042; 0.057)
(6×10 4 ; 9×10 4 )(4×10 4 ; 7×10 4 )
DFM-PC-D-r0.949 (0.913; 0.992) 0.246 ( 0.415 ; 0.101 )0.011 ( 0.002 ; 0.021)1×10 3 5×10 4 0.39 (0.227; 0.739)0.092 (0.074; 0.113)
(9×10 4 ; 1.1×10 3 )(4×10 4 ; 8×10 4 )
DFM-PC-D-s0.998 (0.993; 1) 0.093 ( 0.221 ; 0.029) 0.019 ( 0.025 ; 0.013 )1.3×10 4 8×10 4 0.324 (0.152; 0.616)0.144 (0.114; 0.178)
(1.1×10 4 ; 1.4×10 4 )(5×10 4 ; 1.1×10 3 )
DFM-PC-Mx-r0.985 (0.959; 0.999) 0.042 ( 0.115 ; 0) 0.013 ( 0.02 ; 0.007 )8×10 4 6×10 4 0.044 (0.002; 0.116)0.08 (0.066; 0.094)
(7×10 4 ; 1×10 3 )(4×10 4 ; 8×10 4 )
DFM-PC-Mx-s0.999 (0.995; 1) 0.024 ( 0.111 ; 0.044) 0.02 ( 0.03 ; 0.01 )1.2×10 4 1.7×10 3 0.036 (0.001; 0.137)0.834 (0.594; 0.994)
(8×10 4 ; 2.1×10 4 )(1.2×10 3 ; 2.3×10 3 )
Table 3. The MAP estimates of the static parameters λ , θ , η , σ ϵ 2 , σ γ 2 , σ κ 2 , σ ϱ 2 of log m x , t .
Table 3. The MAP estimates of the static parameters λ , θ , η , σ ϵ 2 , σ γ 2 , σ κ 2 , σ ϱ 2 of log m x , t .
Model λ θ η σ ϵ 2 σ γ 2 σ κ 2 σ ϱ 2
LCC 0.999 0.155 0.024 0.0064 0.0019 0.6172
DFM-PC-B0.998 0.331 0.005 8.00×10 4 5.00×10 4 0.72790.0487
DFM-PC-D-r0.948 0.253 0.0120.0015.00×10 4 0.32340.0895
DFM-PC-D-s0.999 0.095 0.02 0.00137.00×10 4 0.30050.1416
DFM-PC-Mx-r0.995 0.023 0.012 7.00×10 4 6.00×10 4 0.0290.0818
DFM-PC-Mx-s1 0.023 0.021 9.00×10 4 0.00160.00880.8583
Table 4. Mean square error of the fit of the models to the data (MSE), deviance information criterion (DIC) and mean square errors of predictions using forecasting distributions given by MCMC samples ( MSEP M C M C ) and Kalman Filter ( MSEP K a l m a n ).
Table 4. Mean square error of the fit of the models to the data (MSE), deviance information criterion (DIC) and mean square errors of predictions using forecasting distributions given by MCMC samples ( MSEP M C M C ) and Kalman Filter ( MSEP K a l m a n ).
ModelMSEDICMSEP M C M C MSEP K a l m a n
LCC0.0097 3627 0.17780.1774
DFM-PC-B0.0072 6500 0.00570.0062
DFM-PC-D-r0.0182 6380 0.01770.0251
DFM-PC-D-s0.0065 5996 0.01850.0156
DFM-PC-Mx-r0.0081 8225 0.01110.0129
DFM-PC-Mx-s0.0174 3951 0.06920.0285
Table 5. Mean square error of the fit of the models to the data (MSE) and mean square errors of predictions using forecasting distributions given by MCMC samples ( MSEP M C M C ) and Kalman Filter ( MSEP K a l m a n ). The models highlated by bold font exhibit significant levels of over-fitting.
Table 5. Mean square error of the fit of the models to the data (MSE) and mean square errors of predictions using forecasting distributions given by MCMC samples ( MSEP M C M C ) and Kalman Filter ( MSEP K a l m a n ). The models highlated by bold font exhibit significant levels of over-fitting.
ModelMSE MSEP MCMC MSEP Kalman
LC 1 × 1 0.01280.05770.0568
LC 5 × 1 0.01130.04570.0457
LC 5 × 1 , a d j 0.01160.05120.0516
LCC 1 × 1 0.00790.02490.0243
LCC 5 × 1 0.00970.17780.1774
LCC 5 × 1 , a d j 0.00990.16640.1625
LCCF 1 × 1 0.25880.47350.6150
LCCF 5 × 1 0.01070.04580.0464
LCCF 5 × 1 , a d j 0.01310.04810.0508

Share and Cite

MDPI and ACS Style

Toczydlowska, D.; Peters, G.W.; Fung, M.C.; Shevchenko, P.V. Stochastic Period and Cohort Effect State-Space Mortality Models Incorporating Demographic Factors via Probabilistic Robust Principal Components. Risks 2017, 5, 42. https://doi.org/10.3390/risks5030042

AMA Style

Toczydlowska D, Peters GW, Fung MC, Shevchenko PV. Stochastic Period and Cohort Effect State-Space Mortality Models Incorporating Demographic Factors via Probabilistic Robust Principal Components. Risks. 2017; 5(3):42. https://doi.org/10.3390/risks5030042

Chicago/Turabian Style

Toczydlowska, Dorota, Gareth W. Peters, Man Chung Fung, and Pavel V. Shevchenko. 2017. "Stochastic Period and Cohort Effect State-Space Mortality Models Incorporating Demographic Factors via Probabilistic Robust Principal Components" Risks 5, no. 3: 42. https://doi.org/10.3390/risks5030042

APA Style

Toczydlowska, D., Peters, G. W., Fung, M. C., & Shevchenko, P. V. (2017). Stochastic Period and Cohort Effect State-Space Mortality Models Incorporating Demographic Factors via Probabilistic Robust Principal Components. Risks, 5(3), 42. https://doi.org/10.3390/risks5030042

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop