A Square-Root Factor-Based Multi-Population Extension of the Mortality Laws

In this paper, we present and calibrate a multi-population stochastic mortality model based on latent square-root affine factors of the Cox-Ingersoll and Ross type. The model considers a generalization of the traditional actuarial mortality laws to a stochastic, multi-population and time-varying setting. We calibrate the model to fit the mortality dynamics of UK males and females over the last 50 years. We estimate the optimal states and model parameters using quasi-maximum likelihood techniques.


Introduction
Mortality laws are considered traditional actuarial models that describe the link between death probabilities (or the force of mortality) and the age of the individuals. However, it is very soon after continuous-time stochastic mortality models started developing that [1] proposed the use of latent factors to generalize mortality laws to a stochastic continuoustime setting. In the proposed framework, a set of time-varying stochastic components and parameters affected the evolution of mortality rates at different ages. In particular, [1] calibrated the one-year death probabilities of several ages of the Dutch population to 2or 3-factor Gaussian models that extended the first and second Makeham's and Thiele's laws to a stochastic setting. The authors of [2] built on a similar setup to propose a general multi-population stochastic extension of mortality laws. There, the mortality intensities of several populations were affine functions of affine latent factors. Different populations and sub-populations within them had different mortality dynamics because they depended on some common factors-shared among population groups-and certain idiosyncratic factors, which were population-specific. Population-specific parameters interact with such factors to reproduce the age-mortality relation. While the setting described in [2] was very general, the implementation was restricted to the use of Gaussian factors. The Gaussian factors have the advantage of being easier to calibrate. However, Gaussian factors have the disadvantage of allowing for negative mortality intensities with non-negative probabilities. That is a relevant shortcoming, for instance, for annuity pricing purposes. In their work, [2] described the application of a 4-population 3-factor law, they estimated through the use of a sequential Kalman filter. In this paper, we advance the previous work and focus on the case in which factors are non-Gaussian, which has not been previously undertaken in the actuarial literature. While several other works have used factor models to represent the evolution of mortality intensity of multiple ages simultaneously in continuous time (see [3,4], for instance), only a few, however, have considered non-Gaussian factors [5].
In contrast to that paper, which calibrated a model for a restricted portion of the mortality curve of the two-population case, our work tries to parsimoniously capture the evolution of mortality of the entire curve or a large fraction of the ages. Further, this work implements a two-population model that lies within the general framework described in [2] and considers factors modeled as square-root processes à la [6]. Finally, we estimate a stochastic multi-population version of the second Makeham's law and a modification of Thiele's law to fit the observed mortality death rates of United Kingdom females and males from 1967 to 2017 by exploiting the state-space representation of our model. To achieve that, we take a centralized data fusion approach, assuming all observations come from the same source while at the same time assuming Gaussian measurement errors. In addition, the nonnormality of the state process dynamics, which violates the assumptions under which the standard Kalman filter produces optimal state estimates and an exact log-likelihood for the model, was accounted for. Specifically, we followed [7] and modified the Kalman filter structure to introduce non-linearity and apply thus a Quasi-Maximum Likelihood approach. While producing a possibly conditionally biased estimate, the Kalman filter we implemented remains the optimal linear filter [8].
The paper is structured as follows. Section 2 draws a historical picture of mortality laws and provides some motivation for our work. Section 3 describes the modeling setup. Section 4 specifies the state-space form of our model. Finally, Section 5 presents our application of the stochastic multi-population version of the Second Makeham's and a 3-factor modified version of Thiele's mortality law to the UK male and female population.

Mortality Laws: Motivation
At an aggregate level, the most important covariate in explaining individual mortality is age. Indeed, mortality rates have an obvious relation with age. Historically, mortality rates have been high for infants due to the perils of labor and their fragility, increasing at older ages due to biological aging. Due to this persistent phenomenon, actuaries started describing the relation between mortality rates and age through mathematical models, starting from the XVIII century. These mathematical models are of the form: where µ x denotes the mortality intensity of an age x and ξ is a set of parameters. The most famous laws were introduced in the XIX century. In what follows, we give an account of these models. The Gompertz law describes the mortality intensity µ x at age x as a function of two parameters: It captures the phenomenon of so-called senescent mortality, since it assumes an exponential increment in the mortality intensity with age. Makeham laws, proposed in 1867 and 1890 respectively, have the following forms: where, together with the exponential term as in the Gompertz law, a baseline mortality level (A) and a linear term (Hx) are added. The Thiele law, dating back to 1871, aimed at capturing the whole mortality-age curve, by considering, together with the exponential term à la Gompertz, a term that tries to capture the spike of the curve at age 0 associated with infant mortality and a term to reproduce the so-called accident hump: In the last century, several other laws have been proposed. They combine additional terms and different functional forms. Other prominent examples are Heligman-Pollard's laws and the Perks laws, which identify a functional relation between the death rate q or the odds ratio q p (i.e., the ratio between the death rate and the survival probability p), and age. The interested reader may refer to the book [9] for a more detailed account.
The success of traditional mortality laws is due to their ability to capture parsimoniously in a deterministic fashion the relation between the mortality intensity and age. The functional forms used are able to capture the most important trends in such relation, such as the infant mortality pattern and the old-age increase in mortality rates. However, the age-patterns profiles have been changing in time. Thus, the fit of the mortality laws to the observed data of a certain population over different calendar years would produce different parameter estimates. To capture this issue and be able to describe mortality dynamics in a model, while maintaining the traditional actuarial approach of mortality laws, Schrager [1] suggested modeling some parameters in the mortality laws as stochastic factors. Indeed, the introduction of stochastic factors can capture the time trends in human mortality development and the uncertainty surrounding them. The authors of [2] extended that setup to describe the mortality intensity of multiple populations. In that framework, which will be reviewed more properly in the next section, the mortality intensity of individuals of different populations, at different ages and calendar years, is modeled as a function of population-specific parameters and factors. Such an extension allows to jointly describe the mortality intensity of several general populations or sub-groups (socio-economic groups), their evolution in time, and their uncertainty through traditional mortality laws.

The Model
In this section, we present a square-root factor-based approach to stochastic mortality modeling, building on the work in [2]. The mortality intensity of different ages of several populations is modeled parsimoniously as a function of some population-specific parameters and a set of latent factors described by a multivariate Cox, Ingersoll, and Ross (1985) process.

Mortality Intensity and Information
Our continuous-time modeling framework follows the approach first proposed by [10], in which the death process of an individual is described as the first jump time of a Poisson process with stochastic intensity. Mortality intensity is indeed such intensity. This modeling approach parallels the paradigm extensively used in credit risk modeling, in which firm default is similarly represented [11].
Let us introduce a probability space (Ω, F , (F t ) t≥0 , P), with P denoting a (historical) probability measure. The filtration {F t : 0 ≤ t ≤ T} satisfies the usual properties of rightcontinuity and completeness and contains the information on the mortality rates, coming from the historical life tables, contained in sub-filtration G t , as well as the knowledge on whether the individual is alive at time t, contained in the sub-filtration H t . Hence, We assume there exist J ∈ N + populations of interest, indexed by j ∈ J = {1, . . . , J}. Our interest is in modeling the evolution of the mortality intensities of sets of ages (cohorts of individuals sharing the same year of birth) belonging to these different populations. We define the set of ages of interest in population j as X j := {x j min , . . . , x j max }, where integer ages are ordered from the smallest, x min , to the biggest, x max . In principle, we can consider different sets of ages for different populations. Let us define as µ j x+t (t) the set of predictable processes that represent the time t mortality intensity of an age x + t belonging to the cohort aged x at time t.
The mortality intensity of individuals belonging to population j depends on a populationspecific set of parameters, γ j and on a vector of unobservable stochastic factors Y j (t) : The value of these factors represents the mortality trend level, which decreases and represents improvements to longevity due to progress in medicine, food availability, health habits, living conditions, and individual well-being.
We consider that all J populations are affected by the same innovations, described by a vector Z(t) of Brownian noise. However, responses to such changes may differ across populations because factors can be population-specific and because these factors influence population mortality due to the population-specific parameters. Formally, the mortality intensity at time t of an age x + t, belonging to population j, µ j x+t (t), is described by a functional of the vector of state processes Y j : Notice that the functional R j can in principle depend on the age x + t, the cohort x and calendar time t separately. We specify this functional further by assuming it has an affine dependence on the state-vector process. We define then two sets of population-specific functions g j 0 (x + t, t, x) and where · denotes the scalar product.

Stochastic Mortality Laws
We take advantage of the above setup to generalize the most common mortality laws we reviewed in Section 2: • Gompertz's law: In this case, In this case, Additionally, we mention a slight modification of Thiele's law, which was proposed by [2] and which proved to fit the evolution of mortality better at younger ages: We will fit this specification in our application. Notice that g 0 (x + t, t, x) = 0 for all the above specifications.

The State Process Vector
We model the J state processes by means of continuous-time stochastic processes, belonging to the affine class. We focus on purely diffusive square-root processes of the Cox-Ingersoll-Ross type: while Z(t) is a vector of possibly correlated Brownian motions. To summarize, our state process vector captures the time-trend of mortality levels. The different processes impact different slices of the mortality curve, i.e., representing different mortality indexes for different classes of ages. θ j denotes the vector of long-term values of such factors, while A j determines the speed at which the current value of the factor returns to this long-term mean. The diffusion coefficient depends on the current value of the process. Overall, these factors, when the elements of A j and θ j are positive, are always non-negative. As a consequence, the mortality intensities can be guaranteed to be positive, provided that the elements in vectors γ j are appropriately bounded. This choice avoids the issue of dealing with negative intensities with non-zero probability, which is instead a feature of all stochastic mortality models that assume Gaussian innovations.

Survival Probabilities
Because we are in an affine framework, we can compute the expectation of affine functionals of the state processes explicitly (see [12], for instance). In particular, we are able to obtain a closed-form expression for the survival probability T−t p j x+t (t, T), the probability that at calendar time t an age x + t belonging to population j will survive until time T > t. The survival probability is indeed a functional of µ j x+t (t; x): We can give the following Theorem (see [2,13]): Theorem 1. Under the above setup, the time-t survival probability of an individual aged x at t = 0, belonging to population group j, up to time T is where α j (x, t, T) and β j (x, t, T) solve the following system of ordinary differential equations: where α j (x, T, T) = 0 and β j (x, T, T) = 0.
We refer the reader to [2] for a proof of this rather standard result based on the affine-matching principle (see also [14]).

The State-Space Form of the Model
To estimate our model, we write it in state-space form. From now on, we consider observations collected at discrete-time (e.g., annual) intervals. Without loss of generality, we assume that standard Brownian motions drive the processes in the state vector process. Given expression (2), it is easy to notice that the logarithms of survival probabilities are linear functions of the state process vector. Therefore, we consider the opposite of this quantity, which is positive, as our measurement. We denote the observations with t ∈ {1, . . . , T}, and withq j (t) the elements of our observation vector, which are the opposite of the logarithm of the annual survival probabilities. We write: Equation (2) allows us to state that our observations are linear in the values of the factors. We still have to address how to treat the information coming from the life tables of the different populations. For simplicity, we adopt a centralized data fusion approach (see [15]). Hence, we gather all the information as part of a unique observational vector. We assume that each observation is a measurement of a quantity and the measurement entails some noise. We specify thus the following linear model: where j x (t) is the measurement noise. Following [1], we assume that the observation noise is Gaussian with mean zero and standard deviation proportional to the observation itself, i.e., Var[ This assumption makes sense in our context because when age increases, the cohort size decreases. Observations relative to older ages, for whichq x 's are larger, are likely to be less accurate than observations of younger ages. Notice that we allow for different measurement noises strengths k j for each group to deal with the fact that measurements can come from different sources. We further assume that measurement errors are uncorrelated across ages and populations. Coupling this description of our data with the description of the state processes given in Section 3, our model has the following state-space representation: where Y(t) is the M-dimensional vector of state process values that gathers all the state processes influencing the J populations of interest,Q(t) and (t) are the vectors that collect all the observations and corresponding measurement errors, respectively. R(t) is a diagonal matrix, with elements s 2 i (t) on the diagonal. The coefficients collected in α(t) and β(t) are functions of both the parameters of the state-vector process and of the parameters of the mortality law. Y(t) is the vector of state processes, while η(t + 1) collects its noise. Since Y(t) is a vector of square-root diffusions, the noises are not Gaussian: the transition density of each element of Y follows a non-central χ 2 distribution. In this context, the Kalman filter, which is the traditional tool used to estimate Gaussian linear state-space models, is still the optimal linear filter (see [8]). However, although our measurement equation is linear, the standard Kalman filter estimates may be conditionally biased, because innovations in the state processes are not Gaussian (see, for instance, [16]). Several authors have discussed this issue and provided alternative solutions to this problem. We follow [7], who modified the standard Kalman filter to estimate the multi-factor Cox-Ingersoll-Ross model via a quasi-linear Kalman filter. Such estimator introduces modifications to the Kalman filter: • it captures the dependence of the conditional variance on the current value of the state; • it puts a non-negativity constraint on the value of the state process.
Indeed, the modified Kalman filter approximates the state-process transition density to a Gaussian density with the same conditional first two moments and introduces a non-linearity through the non-negativity constraint. The resulting quasi-maximum likelihood is then maximized. In implementing the Kalman filter recursion, we adopt a sequential formulation (see [8,17]) which allows us to avoid matrix inversion. Indeed, we could have followed Monte Carlo approaches as in [18]. In particular, particle filtering was used in [19] in the context of interest rate modeling with CIR-type models. However, given the dimensionality of our problem, which includes several factors, a large number of observations and parameters to estimate, we preferred to apply Quasi-Maximum Likelihood methods, which turned out to be less computationally intensive. Our approach is justified in light of the evidence provided by several works [7,20,21] regarding the accuracy of the method for affine interest rate term-structure models.

Description of the Quasi-Linear Kalman Filter Algorithm
The Quasi-Linear Kalman Filter algorithm, as mentioned above, allows us to obtain an estimate of the conditional state process vector mean and variance covariance matrix, given the observations:Ŷ Additionally, it allows us to compute an approximate log-likelihood for our model, given the parameters. An optimization procedure then will explore the parameter space to maximize such log-likelihood. We report here the algorithm that implements the sequential quasi-linear Kalman filter equations for our model. Let us first denote our matrices as where p = ∑ j X j is the total number of observations considered for each calendar year, q i (t)-with a slight abuse of notation-denotes the i-th observation, where i ranges over the set of integers which uniquely define the elements of the couple (x, j). β i (t)'s are row vectors. From now on, we also suppress dependence on x and x + t for notational convenience. Since R is a diagonal matrix, the equation that describes the model for each single observation of interest is The sequential Kalman filter algorithm runs then as follows: 1. Given a certain parameter set ξ, the filter is initialized using the unconditional mean and variance of the state-process vector,

2.
For each time step t, the state process vector and its variance covariance matrix are , and Γ(t + 1) is the state-process conditional variance/covariance matrix. Notice that such variance/covariance matrix depends onŶ(t), differently relative to the standard Gaussian case.

3.
The estimates are initialized asŶ and then updated after each observation i = 1, ..., p, setting a floor to the value of the estimated mean to zero:Ŷ where the measurement error prediction v i (t) and the Kalman gain K i (t) are respectively: .
Note that v i (t) and F i (t) are scalars and K i (t) is a N-dimensional column vector.

4.
Finally, when i = p, we setŶ and we compute

5.
Using these last expressions for F i (t) and v i (t) we can compute the log-likelihood function: which has to be maximized to obtain the optimal parameter set.

Log-Likelihood Maximisation
Log-likelihood maximization is performed using a global differential evolution algorithm [22]. With this algorithm, several sets of parameters (a population) are initially generated. Then, for each of these sets (individuals), the log-likelihood is computed. The best member of the population, i.e., the parameter set that provides the highest loglikelihood, survives to the next iteration. The other members may survive with a certain probability, but their genes (single parameters) can mutate, i.e., change their value to a different one, randomly picked. The process of population generation and mutation continues until a selected number of iterations is performed, or the best population member is unchanged after a given number of iterations.

Results
Since the UK annuity market represents the largest annuity market in the world, the UK's mortality data often represent a point of departure for the estimation of mortality models. This section applies our model to estimate two stochastic mortality laws to the UK males and females population. The UK aggregate mortality data are sourced from Scotland, England and Wales and Northern Ireland. While we do not consider the regional mortality experiences separately, our modelling framework could encompass the case where region-specific parameters are accounted for, together with different measurement errors across regions. Specifically, our data sample is composed of one-year survival probabilities for ages 0-89 from 1967 to 2017, taken from the Human Mortality Database (www.mortality.org, accessed on 7 December 2021 ).

First Makeham's Multi-Population Law
We estimate the following model, where j = 1, 2 denote the male and female UK population, respectively: In this case, g j (x + t, t, x) := [1, (γ j ) x+t ], with Y ∈ R 2 . The model has two factors, which are the same across the populations. Indeed, we dropped dependence on j above. The coefficients α(·) and β(·) j = [β j 1 , β j 2 ] can thus be obtained solving the following ODE system: with final conditions α j (x, T, T) = 0 and β j (x, T, T) = 0. While affected by the same factors, they respond differently to the second factor. It is the only parameter γ j that shapes the mortality-age relation for the different populations. Indeed, though very parsimonious, this specification is not rich enough to capture well the complexity of the whole mortality curve. We restrict our application to older ages only, ranging from 50 to 89. Table 1 reports the estimated parameters. The population-specific parameter, which captures the steepness of the exponential increase in mortality at old ages captured by Makeham's mortality law is not unexpectedly higher for UK males (0.1153) than for females (0.0944). Table 1. This table collects the optimal parameters for our model. For each population, the "common" column collects the parameters relative to the factor processes and the measurement variancecovariance matrix s. The other two columns collect the parameters relative to the response functions of the two populations of males and females.

Maximized
Number  Figure 1 displays the evolution of the values of the latent factors in time. The first factor, which acts as a baseline mortality intensity level, is estimated to decline rather smoothly over time. Notice that its initial value is higher than the long-term estimated average θ 1 (0.0014). This parameter can be interpreted as the estimated biological limit which the value of the factor can tend to. The speed of mean reversion a 1 (0.0233) can be interpreted as the speed at which the mortality baseline level tends to converge to such long-term value. The second factor, which interacts with age to reproduce the exponential increase of mortality at older ages, also declines. Its volatility is higher because it influences mortality at older ages, reflecting the well-known fact that higher uncertainty surrounds older age mortality improvements. The current values of the factors (i.e., the estimates for the last calendar year) are still far from their long-term averages. Taking into account their exponential decay, we expect the current values (4 × 10 −3 and 6 × 10 −6 respectively) to halve in approximately 173 years, and 115,524 years respectively, and thus still be above their long-term average. The model produces a good fit of the mortality surface, despite its very simple form. Figure 2 reports the Mean Absolute Relative Errors of the single female and male ages (averaged over calendar years): , and of each calendar year (averaged over all the observations): whereq denotes the fitted one-year death probability. The model performance is good, given that 4080 observations are fitted with only nine parameters. Average MARE is 7.22%, and it is lower than 10% for almost all ages, apart from older males. This is in line with the two-factor model performance in [1], despite our two-population fit. Only older male figures seem to be more problematic to fit. Performance may be improved assuming that one state process (for instance, the first factor) differs across the two populations.

Modified Thiele's Law
We estimate the following model, where j = 1, 2 denote the male and female UK population, respectively: In this case, g j (x + t, t, x) := [e −γ j 1 (x+t) with final conditions α j (x, T, T) = 0 and β j (x, T, T) = 0. Table 2 reports the estimated parameters. The factors have a constantly decreasing trend in time, apart from the second factor, which decreases over the last 15 calendar years only as depicted by Figure 3 (see Table1). The first and the third factors decline very fast until the 2000s, to depict the extraordinary improvements in infant mortality and mortality at older ages. As for the Makeham's law case, the factors are still way above their long-term averages θ and thus we expect further improvements of mortality in the long-run. Such long-term averages are very important for long-run projections of mortality rates. In unreported estimations, we test the changes of the values of these parameters to slight changes in the estimation time window, finding reasonable stability. Table 2. This table collects the optimal parameters for our model and is divided in two panels, one for each population. For each population, the "common" column collects the parameters relative to the factor processes and the measurement variance covariance matrix s. The other two columns collect the parameters relative to the response functions of the two sub-populations of males and females.

Maximized
Number   In our estimation procedure, we have not constrained the value of the parameter that controls the age which is mostly affected by the second factor. Thus, the factor affects the different populations differently. For example, it influences female mortality at young adult ages to help to address the accident hump (γ f 4 = 23.1866). For males, it instead adds up to the third factor in influencing mortality at older ages (γ m 4 = 88.2853). As Figure 4 shows, the model's fit is very good for both female and male populations, with an average MARE across all observations of 11.47%. The first factor helps capture the infant mortality spike at age 0. Mortality at young adult ages and at older ages is slightly more difficult to capture for males, for which MARE is above the average between ages 18 and 23. Between ages 40 and 70, the curve's fit is better for males than females.

Conclusions
In this paper, we have focused on a stochastic and multi-population extension of traditional actuarial mortality laws. We place ourselves within the setup introduced by [2]. The mortality intensities of all (or a large set of) the ages of multiple populations are modeled within the affine framework as functions of population-specific parameters, which shape the age dependence and some unobservable factors that describe the evolution of mortality trends over time. Here, these latent factors are modeled as square-root processes and describe the evolution of mortality trends over time. The estimation procedure for such models entails dealing with a non-Gaussian state-space model. In this case, estimation via Kalman filter still provides us with the optimal linear state estimator. However, this may be sub-optimal due to the non-linearities introduced by the state-vector process density. We followed [7] and applied a Quasi-Linear Kalman Filter algorithm that approximates the state process transition density to a Gaussian density truncated at 0 with the same conditional first two moments. Although approximate, the procedure was proven accurate enough in a similar setup when fitting multi-factor CIR models for interest rates. Relative to previous similar multi-population models, the mortality intensities are thus guaranteed to be non-negative. This is indeed an advantage relative to Gaussian models, especially when the model needs to be applied to forecast intensities over long time periods. Obviously, this can be important when dealing with the pricing of long-term liabilities, such as annuities, or with risk management for pension funds. Inspired by the Makeham and the Thiele laws, our application provides the calibration of two stochastic multi-population mortality laws to the UK populations of males and females mortality rates jointly from 1968 to 2017. The models are parsimonious and are shown to provide a good fit of the mortality surface over time.