A Multistate Analysis of Policyholder Behaviour in Life Insurance—Lasso-Based Modelling Approaches

Reck, Lucas; Schupp, Johannes; Reuß, Andreas

doi:10.3390/risks13040073

Open AccessArticle

A Multistate Analysis of Policyholder Behaviour in Life Insurance—Lasso-Based Modelling Approaches

by

Lucas Reck

^1,2,*

,

Johannes Schupp

¹

and

Andreas Reuß

¹

Institute for Finance and Actuarial Sciences (ifa), 89081 Ulm, Germany

²

Institute of Insurance Science, Ulm University, 89081 Ulm, Germany

^*

Author to whom correspondence should be addressed.

Risks 2025, 13(4), 73; https://doi.org/10.3390/risks13040073

Submission received: 6 December 2024 / Revised: 31 March 2025 / Accepted: 3 April 2025 / Published: 9 April 2025

(This article belongs to the Special Issue Statistical Models for Insurance)

Download

Browse Figures

Versions Notes

Abstract

Holders of life insurance policies can exercise various options that lead to contract modifications, e.g., full surrender, partial surrender, and paid-up and dynamic premium increase options. Transitions between these contract states materially affect (current and future) cash flows and thus represent a serious source of uncertainty for an insurance company. It is common practice to determine best-estimate assumptions for these transitions independently, i.e., without considering joint determinants of the different aspects of policyholder behaviour. The recent literature also incorporates multistate classical statistical models. Our paper shows how consistent best-estimate transition rates for multiple status transitions can be derived using data science methods. More specifically, we extend existing multivariate approaches based on established statistical models (generalised linear models) with the Lasso method, such that the key drivers for each transition can be identified automatically. We discuss the performance, the complexity and the practical applicability of the different modelling approaches based on data from a European insurer.

Keywords:

multistate; multi-class; lapse rate; paid-up; life insurance; Lasso

1. Introduction

1.1. Motivation and Practical Relevance

A crucial part of the risk management of a life insurance company is the proper modelling of future policyholder behaviour. Contract modifications for existing policies are an important part of this customer behaviour, and there are several different legal and contractual options for the policyholder, e.g., full surrender, partial surrender, and paid-up and dynamic premium increase options. The resulting various contract states and, in particular, the transition between states have a material impact on the cash flow profile of the insurance company. This poses a serious risk, as the cash flows have a direct impact on asset liability management. Consequently, under the European regulatory framework Solvency II, all contractual options and the factors affecting the exercise of these options need to be taken into account in the best estimate valuation, and the risk related to all legal and contractual policyholder rights has to be assessed in a separate risk sub-module denoted as ‘lapse risk’ (see articles 32 and 142 in EU 2015). In practice, independent models for each policyholder option are built with the Whittaker–Henderson approach (a univariate smoothing algorithm; see, for example, SAV (2018)) or generalised linear models (GLMs; see, for example, Haberman and Renshaw (1996)). However, modelling these risks separately can lead to false interpretations and bad management decisions. We, therefore, advocate for a holistic modelling of policyholder options that allows for a consistent prediction of future cash flows. Furthermore, the choice of the ‘correct’ variables for the Whittaker–Henderson approach and the GLM is both manual (and, therefore, subjective) and time-consuming since a potential new covariate (or interaction) requires a full readjustment of the previously considered covariates.

In this paper, we show how holistic policyholder behaviour models can be set up efficiently. The models can be used to derive best-estimate transition rates for multiple contract states (e.g., active, paid-up, lapse, etc.). We show how the transition rates can be estimated for each customer in each contractual year, given their respective policy information (e.g., age, sum insured, etc.) and transition history (e.g., already transitioned to the paid-up state). Of course, there are different multivariate modelling approaches with different levels of manual interventions. In particular, data analytics techniques such as the Lasso can be used to replace the manual process of variable selection with a data-driven approach. This also enables us to compare the different multivariate modelling approaches objectively and fairly. The joint modelling approach also addresses modelling issues due to low data volume for policyholder options that are less frequently exercised. The major advantage of the Lasso is its interpretability: unlike other machine learning approaches like gradient-boosting machines (GBMs) or neural networks, the Lasso remains fully interpretable and individual model parameters can be explained. Depending on the use case, the interpretability of the model is essential.

Building a multi-state model for policyholder behaviour has two core dimensions of complexity. First, the overall model choice: several approaches in the field of survival analysis, (generalised) linear models and other machine learning areas are available. We choose to focus on GLM-based models using extended Lasso penalties. With that, we can model complex policyholder behaviour while retaining parsimony and interpretability. But even within the Lasso-based models, there are numerous ways of allowing for multiple states. Second, the inclusion of the transition history in the model: there are different ways of including information about previous states of a life insurance contract that may impact the probability of future transitions.

The practical implications of this analysis are mainly relevant to the actuarial and risk management department. We present innovative tools to derive best-estimate transition rates for multiple policyholder options, which is a mandatory task under Solvency II. This means that a previously very time-consuming step can be significantly improved (with respect to performance and workload). Other departments (e.g., the marketing department) might also benefit from the results of such a model, e.g., by establishing a lapse prevention strategy or by designing new products.

1.2. Literature Review and Contribution

In the actuarial literature, binary-lapse behaviour for insurance companies has been analysed thoroughly. Most research focuses on macroeconomic variables (e.g., interest rates or unemployment rates). They analyse hypotheses like the interest rate hypotheses or the emerging fund hypothesis. See, for example, Kiesenbauer (2012) for the German market or Poufinas and Michaelide (2018) for the Greek market. Due to the confidentiality of policyholder data, there is limited research on the effect of policy-specific variables (e.g., contract duration or sum insured) on lapse behaviour. Refer to Eling and Kochanski (2013) for an overview of both macroeconomic and policy-specific research on lapses in life insurance.

The main tools used to analyse lapse behaviour on a policyholder level are survival analysis, see Milhaud and Dutang (2018), and generalised linear models (GLMs); see Barucci et al. (2020) and Eling and Kiesenbauer (2014). There are also machine learning approaches to analyse lapse behaviour. Refer to Reck et al. (2023), Azzone et al. (2022), and Xong and Kang (2019) for an extended Lasso approach, a random forest or a neural network and a support vector machine, respectively.

In these papers, lapse is the only state (in addition to the active state), meaning that the response is treated as a binary variable. Now, we allow for more states and address multi-state policyholder behaviour (with lapse being one possible state). Modelling several states and transitions between these states is not an entirely new topic in actuarial science and was already discussed by Gatenby and Ward (1994). In fact, this type of analysis is common in certain areas, especially in health insurance; e.g., with possible states active, disabled, and dead. Of course, the specific states and possible transitions depend on the insurance product; see Christiansen (2012).

There are also some multi-state applications in life insurance: L. Zhang (2016) uses a Markov process to model different fitness states (and transitions between them) and their effect on mortality. Kwon and Jones (2008) analyse mortality rates for potentially changing states of socio-economic factors (e.g., income or smoking). Milhaud and Dutang (2018) analyse lapse behaviour with multiple possible states using a competing risk approach (survival analysis). Finally, Dong et al. (2022) analyse customer churn using multinomial logistic regression (MLR) and a second-order Markov assumption. They also compare the MLR with a binary one-versus-all model, as well as a gradient boosting machine and a support vector machine.

In this paper, we contribute to the actuarial literature via the implementation of different modelling approaches based on extended versions of the Lasso. These incorporate various penalty types which can be used to automatically generate a parsimonious and competitive model. The benefits of the Lasso are transferred to various policyholder options, e.g., lapse or paid up. This includes a discussion of the architecture of the different approaches, the corresponding aggregation schemes applied to get the final predictions for the transition rates, and different orders of the Markov assumption. We compare the different modelling approaches and Markov assumptions quantitatively and qualitatively.

1.3. Structure of the Paper

The remainder of this paper is organised as follows: In Section 2, we introduce different modelling approaches for the underlying multi-state problem and discuss the qualitative features of each approach. We also present different ways of including the transition history in each modelling approach. Section 3 introduces the data set and other details of the implementation. It supplements the previous section by adding quantitative aspects of the different modelling approaches. In Section 4, we show the numerical results and compare the different modelling approaches and ways of including the transition history. Finally, Section 5 concludes the work.

2. Modelling Multiple Status Transitions

In this section, we present different approaches to modelling multiple status transitions and focus on qualitative aspects of the model selection: we differentiate with respect to the structure, uniqueness, complexity of the model and the possibility to generalise to an arbitrary number of transitions. We also analyse and compare approaches to considering the history of an insurance contract in a model. This section focuses on the architecture of the different modelling approaches, while the specific implementation and application for our data set are described in Section 3 and compared in Section 4.

In our analysis of a typical insurance portfolio, only annual status transitions are tracked. Therefore, the initial status at the beginning of a year and the status at the end of the year are recorded, and exactly one potential transition during the year can be further analysed and considered in a multi-class problem. In the literature, there are several model structures for multi-class response variables where an estimated probability for each potential class is derived, and there are basically two ways of modelling such a multi-class problem: Firstly, decomposition strategies transform the original multi-class problem into several binary problems and subsequently combine them to obtain a multi-class model. Those approaches are particularly interesting for machine learning models like support vector machines, which were originally designed for binary problems. In the GLM framework, the binary model corresponds to logistic regression. An overview of decomposition strategies is given in Lorena et al. (2008). The decomposition strategies used for this analysis are described and discussed in Section 2.1, Section 2.2 and Section 2.3. Secondly, models with a holistic strategy can handle multiple classes directly with no need for decomposition. In the GLM framework, this corresponds to an MLR, as described in Frees (2004). This approach is described in Section 2.4.

In the analysis of policyholder behaviour, the transition history may impact future transition probabilities, e.g., the lapse probability may be increased for a contract that was paid up recently compared to a contract that has been paid up for a longer time. Therefore, different approaches of including the history in the model are discussed in Section 2.5. These approaches can be applied to both the decomposition approaches and the holistic approach.

For recurring terms, we use the following notation:

K is the set of possible classes for the response variable Y with $m = | K |$ potential classes.
Based on Allwein et al. (2000), M corresponds to a coding matrix with possible entries, $m_{i, j} \in {- 1, 0, 1}$ . Each column, j, corresponds to a binary base model, indicating whether the class (in row i) has a positive label ( $m_{i, j} = 1$ ), has a negative label ( $m_{i, j} = - 1$ ) or is not included ( $m_{i, j} = 0$ ). The latter means that data from class i are not reflected in the calibration of model j.
$p_{I} : = P (Y \in I | x)$ describes the (predicted) probability that an observation, x, is in the subset of classes I.
$x^{J}$ denotes the subset of the observations, where $y \in J$ .
$p_{I}^{J} : = P (Y \in I | x^{J})$ .
In general, p describes a (predicted) probability from a binary model, i.e., before aggregation, and q describes a (predicted) probability for a multi-class model, i.e., after aggregation.

Hypothetical example:We illustrate each model with a hypothetical data set with three classes (

K = {A, B, C}

and

m = 3

) for the response variable

Y \in K

and two not-further-specified covariates,

X_{1}

and

X_{2}

(adopted from Z. Zhang et al. 2016); see Figure 1. Each observation,

(x_{i}, y_{i})

, in this example is depicted in the two-dimensional plane (covariates), where each class (response) is visualised using a different colour and symbol.

2.1. One vs. All Model

The one vs. all (OVA) model (also called one against all (OAA)) is a popular choice among the decomposition strategies. As the name suggests, the OVA approach builds several binary models; each models one class versus all the other classes:

M_{OVA} = (\begin{matrix} 1 & - 1 & - 1 \\ - 1 & 1 & - 1 \\ - 1 & - 1 & 1 \end{matrix}) .

(1)

Figure 2 shows the model architecture for the hypothetical example. The black lines represent binary model results separating the two relevant classes using the two covariates.

An aggregation scheme is necessary to transform the probabilities from the binary models into a multinomial model with probabilities for each class1. Note that the models themselves are unbiased in the sense that the average predicted value equals the average observed value. Therefore, the average predicted values of the individual models add up to one. However, this is not necessarily valid for a single observation, and in general, the sum of the predicted probabilities is not equal to one here. An intuitive approach for the aggregation is a rescaling of the individual probabilities, such that the original ratios are conserved, but the probabilities also add up to one for each single observation:

q_{i} = \frac{p_{i}}{\sum_{i} p_{i}}

(2)

This aggregation scheme is also statistically motivated since the rescaled probabilities,

q_{i}

, are closest (in terms of the Kullback–Leibler distance; see Kullback and Leibler 1951) to the individual probabilities,

p_{i}

, while adding up to one:

\begin{matrix} \min_{q} \sum_{i} p_{i} log \frac{p_{i}}{q_{i}}, s . t . \sum_{i} q_{i} = 1 . \end{matrix}

(3)

There are also other possible aggregation schemes.

For the OVA model with the rescaling aggregation scheme, there are m binary models that have a unique definition and are independent of each other. However, the aggregation scheme implies a dependency on the actual target, q. Thus, a prediction for a specific class may impact or even worsen the predictions for other classes.

2.2. One-vs.-One Model

The one-vs. -one (OVO) model (also called one-against-one (OAO)) is another popular choice among the decomposition strategies. As the name suggests, the OVO approach builds several binary models; each models one class versus another class:

M_{OVO} = (\begin{matrix} 1 & 1 & 0 \\ - 1 & 0 & 1 \\ 0 & - 1 & - 1 \end{matrix}) .

(4)

Figure 3 shows the model architecture for the hypothetical example. Again, an aggregation scheme is necessary to obtain a multinomial model with probabilities for each class2. There are several possibilities on how to aggregate the probabilities from the individual binary models,

p_{i}^{J}

, in order to estimate the probability distribution of the underlying data set,

q_{i}

; see Galar et al. (2011). Also, note that the individual models in the OVO approach can ‘contradict’ each other in the sense that there is no compatible set of probabilities,

q_{i}

; e.g., if

p_{A}^{A, B} = 0.9, p_{B}^{B, C} = 0.8

and

p_{A}^{A, C} = 0.2

. In this analysis, we apply pairwise coupling, as proposed by Hastie and Tibshirani (1997) for the aggregation. The idea is to minimise the (weighted) sum of the Kullback-Leibler distances between

p_{i}^{i, j}

and

μ_{i, j} = \frac{q_{i}}{q_{i} + q_{j}}

, where the weights

w_{i, j}

correspond to the number of observations in

x^{i, j}

:

\min_{q} \sum_{i < j} w_{i, j} [p_{i}^{i, j} log \frac{p_{i}^{i, j}}{μ_{i, j}} + p_{j}^{i, j} log \frac{p_{j}^{i, j}}{μ_{j, i}}]

(5)

For the example above, where

p_{A}^{A, B} = 0.9, p_{B}^{B, C} = 0.8

and

p_{A}^{A, C} = 0.2

, the proposed aggregation scheme (with

w_{i, j} = 1 \forall i, j

) gives

q_{A} = 0.38, q_{B} = 0.29

and

q_{C} = 0.33

. Since the individual models would imply that A is more likely than B, and that B is more likely than C, but then that C is more likely than A, the aggregated probabilities around

\frac{1}{3}

seem reasonable. The order

q_{A} > q_{C} > q_{B}

appears plausible as well since the combined estimated probabilities from the individual models are

1.1 > 1.0 > 0.9

for classes

A, C

and B.

An alternative and intuitive aggregation scheme for the OVO approach is the rescaling of the combined estimated probabilities, such that the resulting probabilities add up to one. In the example above, this would lead to

q_{A} = \frac{1.1}{3.0} = 0.37, q_{B} = \frac{0.9}{3.0} = 0.30

and

q_{C} = \frac{1.0}{3.0} = 0.33

. Even though the resulting probabilities (for this example) are very similar to those from the pairwise coupling, the alternative aggregation scheme performed significantly worse in the comparisons, as laid out in Section 4. Therefore, we decided to omit this alternative aggregation scheme and focus on pairwise coupling even though the interpretability of the OVO approach suffers from the rather complicated aggregation scheme.

The number of necessary binary models increases rapidly with the cardinality of K. In general, there are

\frac{m (m - 1)}{2}

models in the case of m classes for the response variable, and the individual models are independent for the sub-samples. As for the OVA, the aggregation scheme implies a dependency on the actual target, q.

2.3. Nested Model

Nested models (also called hierarchical models) are another example for combining binary models to arrive at a multinomial model. OVA and OVO are symmetric in the sense that changing the order of the classes does not impact the result, i.e., these models have one unique definition. Within the nested approach, there are multiple ways of defining the hierarchical model setup. For example, in the case of three classes, there are three possible model architectures:

M_{Nested A} = (\begin{matrix} 1 & 0 \\ - 1 & 1 \\ - 1 & - 1 \end{matrix}), M_{Nested B} = (\begin{matrix} - 1 & 1 \\ 1 & 0 \\ - 1 & - 1 \end{matrix}), M_{Nested C} = (\begin{matrix} - 1 & 1 \\ - 1 & - 1 \\ 1 & 0 \end{matrix}) .

(6)

Figure 4 shows the first model architecture (Nested A) for the hypothetical example.

In general, there are many separation possibilities; see Equation (8) below. In each step of the hierarchy, the remaining classes are split into two (potentially imbalanced) parts until, finally, each class is separated. Therefore, each individual class, k, has subsets,

j_{i}

, describing the unique separation path of length s:

K = j_{0} \supset j_{1} \supset \dots \supset, j_{s - 1} \supset j_{s} = k

.

The hierarchical design of the aggregation scheme ensures well-defined probabilities, and thus, an aggregation scheme is not necessary for the nested approaches. To obtain the prediction of a class, we use the specific path for that class:

q_{k} = \prod_{i = 0}^{s - 1} p_{j_{i + 1}}^{j_{i}} .

(7)

The specific order is rather arbitrary and has a major impact on the complexity of the prediction,

q_{k}

. We will see in Section 3 that this order also affects the quality of the model fit tremendously. This dependency on the order is a major disadvantage of the nested models. In some applications, expert judgement can help identify a reasonable order. In many applications, however, the best possible order is a priori unknown.

In general, there are

m - 1

models in the case of m classes for the response variable, as for the OVO approach. However, this time, the order matters. The models are not independent since, in each step of the hierarchy, the models are conditioned on the result of the higher levels of the hierarchy. We have seen in the hypothetical example that, for three classes, only three different orders are possible. But, with four classes, there are 15 possible orders, and with five classes, they are already 105. To obtain a general formula for the number of possible orders, we derived a recursive formula based on combinatorial terms and then transformed it into the following compact representation using the gamma function:

f (m) = \frac{2^{m - 1} Γ (m - \frac{1}{2})}{\sqrt{π}},

(8)

where

f (m)

is the number of possible orders for m final classes.

Thus, there is potentially a huge number of possible model specifications and a high complexity introduced via the number of binary models.

2.4. Multinomial Model

Instead of decomposing the multi-class problem into several binary problems (e.g., using the OVA, OVO, or nested methods), we can also approach the multinomial problem directly. Within the GLM framework, the most intuitive way is to use an MLR, which is a direct generalisation of the logistic regression; see Frees (2004). Figure 5 shows the model architecture for the hypothetical example. Also, outside of the GLM framework, there are several algorithms capable of modelling multiple classes, e.g., neural networks with a softmax layer or random forests.

The MLR estimates all probabilities simultaneously. Therefore, we do not need to transform individual probabilities using an aggregation scheme as before. The black line in Figure 5 represents a hypothetical model to separate the classes—this time, three classes simultaneously, instead of just two classes at a time.

This approach only uses one model, no matter how many states the response variable has. Hence, it uses all information simultaneously for the prediction of an observation. The model structure is unique; therefore the model imposes a significant complexity reduction.

2.5. Transition History

There are many applications—including ours—where observations are made over time. In our data set (see Section 3.1), this corresponds to a yearly observation for each in-force contract. At the end of each complete contract year, the current state is tracked (it can be active, paid-up, or lapse). Therefore, we do not differentiate between, for example, lapse after 4.5 contract years or 4.8 contract years. The remaining covariates do not change over time. This results in a sequence of possible states and transitions for each observation. The transition history of an observation may impact the probabilities of future transitions, e.g., the lapse probability may be increased for a contract that was paid up recently compared to a contract that has been paid up for a longer time. In order to improve estimates of transition probabilities, it may be useful to consider the past of an observation in the model. There are several ways to include the transition history in the model, and we focus on the following three approaches:

No previous information: There is always the possibility of ignoring any information about previous states. This is the easiest and most primitive way of dealing with the transition history but, it can still be legitimate for applications where the history is obviously irrelevant.
Markov property: A Markov property can be assumed, see Dynkin (1965). The ‘past’ (transition history) does not matter for the ‘future’ (predictions), given that the ‘present’ (current state) is known, i.e.,:

$P (Y_{t} = y_{t} | Y_{t - 1} = y_{t - 1}, Y_{t - 2} = y_{t - 2}, \dots, Y_{1} = y_{1}) = P (Y_{t} = y_{t} | Y_{t - 1} = y_{t - 1}),$

(9)

where t indicates the time (=contract duration) of an observation and $Y_{t}$ corresponds to the state (active, paid-up, or lapse) in that year, t. Hence, we are modelling yearly lapse and paid-up rates.
Full transition history: There are also applications in which it is possible to define one new covariate (or more), which represents the state history sufficiently. This highly depends on the number of states and the structure and dependencies of the underlying data set. In our specific example, the time since paying up seems to sufficiently describe the state history; see Section 3.1 for details.

In the second and third approaches, there is still flexibility in how to include the information in the model. Let the status history be sufficiently described in one (or more) new covariates,

X_{h i s t o r y}

. Subsequently, it can be treated as a normal covariate in the model. Alternatively, the covariate and all its interactions with the remaining covariates can be included in the model. This allows the model to identify more complex structures in the data. A third way is splitting the data according to that covariate and building separate models for each subset. Note that this increases the number of individual models for each model architecture. Including the covariate in the model formula is no longer necessary since it is just constant for each subset. However, in cases where the history can only be captured sufficiently using many new covariates, or if the number of possible classes is high, this third approach may not be feasible.

All above-mentioned approaches are analysed for our data set in Section 4.

3. Application for a European Life Insurer

In this section, we further specify, apply, and compare the approaches to modelling multiple status transitions for a portfolio of life insurance contracts provided by a pan-European Insurer operating in four countries. An elaborate description of the considered models can be found in McCullagh and Nelder (1989) for GLMs, in R. Tibshirani (1996) for the Lasso, in R. Tibshirani et al. (2005) for the fused Lasso, and in Kim et al. (2009) for the trend filtering. Also, see Reck et al. (2023) for the application of the extended versions of the Lasso to a logistic lapse model.

3.1. Data Description

The data set contains a large number of insurance contracts over an observation period of around 21 years. This particular portfolio went into run-off after about 11 years, which means that no new contracts were written after that. Nevertheless, the insurer is required to fulfil the regulatory requirements (and the estimation of the best estimate transition rates are essential for that) until all contracts have expired. Also, the insurer does not aim to transfer the business to an active portfolio, as it is specialised in run-off business. The fact that this portfolio went into run-off does not affect the proposed modelling approaches—they can also be applied to portfolios that are not in run-off. However, this should be kept in mind when interpreting the coefficients in Appendix A. The data set comes from a single source, is well structured, and does not contain any missing values. It is treated confidentially, as it contains customer and company-related information.

Life insurance contracts typically include several options available to policyholders, such as full surrender (lapse) and the stopping of regular premium payments (paid-up), but they also include the option for the reinstatement of premium payments after paying up, partial surrender, pre-defined dynamic premium increases and other premium increases, or the payment of top-up premiums. For the contracts in this particular portfolio, the key observed status transitions of the contracts are the option of paying up a contract and the exercise of a full surrender (lapse) option. Thus, the data set appears well suited to applying the presented models using three possible states: active (A), paid-up (P), and lapse (L). Note that, for other portfolios, significantly more transitions may need to be modelled.

In this paper, the option to ‘lapse’ is defined to comprise surrender (the policyholder cancels the contract and gets the surrender value), pure lapse (the insurance contract is terminated without a surrender value payment), and transfer (the policyholder cancels the contract and transfers the surrender value to another insurance company). The second option, ‘paid-up’, is defined by a reduction in the regular premium payments to zero, but the contract remains in force. Obviously, an execution of the first option implies a transition to a terminal status. The paid-up option leaves a potential lapse option open in the future.

There are n = 1,070,139 observations (167,659 unique contracts) with an extended set of up to

J = 15

covariates: contract duration (the number of years between inception and observation time), insurance type (traditional or unit-linked), country (four European countries), gender, payment frequency (e.g., monthly or annually), payment method (e.g., debit advice or depositor), nationality (whether or not the country, in which the insurance was sold equals the nationality of the policyholder), the dynamic premium increase percentage, entry age, the original term of the contract, the premium payment duration, the sum insured, and the yearly premium. Other specifications about the insurance type (other than traditional or unit-linked) are not available in our data set and could, therefore, not be included in the model calibration. A finer distinction (e.g., term life, whole life, the existence of riders, etc.) can provide further interesting insights. Moreover, options like top-ups and premium holidays are not material for this run-off portfolio, as they are rarely observed. This could be material for other life insurance portfolios.

We extend this original set of covariates by including two covariates that contain information about the previous state (or states) of a policyholder. For some models, we only consider the previous status (i.e., active or paid-up, since lapse is the terminal status). For other models, we also add another covariate, time since paying up, which counts the years between being paid up and the observation time. It is defined as 0 as long as a policyholder is still active (of course, each policyholder initially starts in the active state). For this data set, there are no observations with reinstatement, meaning that we do not observe the transition P → A. Other covariates do not change over time and are, therefore, deterministic for our life insurance portfolio. Hence, the covariates’ contract duration and the time since paying up uniquely determine the full state history of a contract3.

For modelling purposes, a separate row is created for each in-force contract in each observation year. As a consequence, one single contract may occur in several rows in the data set—once for each observation year when the contract is still in force. This differs from a standard survival analysis setup, where we would typically have one observation per contract. This observation would then contain information about the duration (until being paid up and until lapsing), potential censoring, and the final state. This modification leads to a data set where the rows are no longer fully independent, and we also have a selection bias (contracts being in force longer get more weight than contracts lapsing early). Given the size of our data set, this modification seems justifiable. However, in general, the effect of this modelling approach should be analysed carefully.

Figure 6 shows the decreasing exposure (upper part) and the composition of the three states (lower part) for different values of contract duration. The one-year lapse rates clearly decrease rapidly in the first three years, which is consistent with the well-known higher lapse rate at the beginning of an insurance contract. The one-year paid-up rate shows a different trend: starting with a very small percentage for the first year of the contract, the rate increases over time until it reaches a certain threshold. Although the rates have different trends, they should be modelled consistently together.

The calendar year might contain valuable information about policyholder behaviour, e.g., with higher lapse rates during a financial crisis (emerging fund hypothesis). To quantify the calendar year effect, we add a new covariate in a sensitivity analysis for one of the qualitatively and quantitatively best models. The results are discussed and evaluated in Section 4.2. For future transition rates, however, the calendar year effect is unknown a priori, and therefore, further assumptions are required to derive a best estimate.

Other macroeconomic indicators for a specific calendar year, like the swap rate, inflation rate, or stock performance, or demographic indicators for a specific country, like the social security system, might contain helpful information for many portfolios. However, since previous research suggests that they turn out to be not significant on this particular data set (see Reck et al. 2023), we refrain from a deeper evaluation in this analysis.

One useful pre-processing step in this framework is the binning of continuous covariates. With that, the corresponding covariate has several pre-defined category levels (bins). Without binning, the model can either estimate a single parameter for the continuous covariate (and potentially underfit) or estimate a parameter for every single value of the covariate by treating it as a factor (and potentially overfit). With binning, we are, therefore, able to derive an interpretable model with good predictive power to satisfy the requirements of an insurance company.

There are different possibilities for choosing adequate bins, e.g., bucket binning (each bin has the same length), quantile binning (each bin has the same number of observations), or simply relying on expert knowledge. There are also more complex ways to define bins: Henckaerts et al. (2018) use evolutionary trees as described in Grubinger et al. (2014) to estimate optimal bins for continuous variables. These evolutionary trees incorporate genetic algorithms into the classical tree framework to find the global optimum by also allowing changes in previous splits. We choose a rather simple data-driven approach (no expert knowledge) to derive the bins and follow the approach of Reck et al. (2023) using a univariate decision tree for each continuous covariate in the data set: the entry age, the original term of the contract, the premium payment duration, the sum insured, and the yearly premium. To avoid overfitting, we use small trees with at least 5% of the observations in each terminal leaf (see again Reck et al. 2023). Note that contract duration is not assumed to be continuous, and no univariate decision tree is built. The category levels are, therefore, just the natural numbers up to 20.

3.2. General Model Setup

Since we are looking for a parsimonious and interpretable model, we focus on GLM-based models using extended versions of the Lasso in our analysis. The probabilities are defined as follows:

\begin{matrix} P (Y = 1 | x) = \frac{e^{x^{T} β}}{1 + e^{x^{T} β}}, \\ P (Y = k | x) = \frac{e^{x^{T} β_{k}}}{\sum_{l = 1}^{m} e^{x^{T} β_{l}}}, \end{matrix}

(10)

where Y is the target variable, x is the covariate vector, and

β

is the model parameter vector. The first equation corresponds to the logistic regression (the binary case of the decomposition strategies), and the second equation corresponds to the MLR with m classes. The logistic regression is a special case of the MLR with

m = 2

. However, in Equation (10), the reference level for the logistic regression is implicitly set to one of the two classes, while there is no explicit reference level for the MLR. Consequently, we only consider those coefficients not corresponding to class A in the results to make the comparison of the number of coefficients in the models fair.

For the logistic regression, we have the following log-likelihood function:

log Likelihood (β | x, y) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} (x_{i}^{T} β) - log (1 + e^{x_{i}^{T} β})) .

(11)

The parameters

β

are estimated with a penalised maximum likelihood optimisation (regularisation). We use the methodology proposed by Reck et al. (2023):

- log Likelihood {(β | x, y)}_{L a s s o} = - log Likelihood (β | x, y) + λ \sum_{j = 1}^{J} g_{P e n_{j}} (β_{j}),

(12)

where

λ

controls the penalisation strength (

λ = 0

corresponds to a regular GLM without penalisation, and higher values for

λ

correspond to a higher penalisation and, therefore, fewer parameters), and

P e n_{j}

represents different versions of the Lasso, i.e., a regular Lasso, a fused Lasso, and trend filtering:

\begin{matrix} g_{R} (β_{j}) & = & \sum_{i = 1}^{p_{j}} | β_{j, i} | = : \sum_{i = 1}^{p_{j}} | β_{j, i}^{R} |, \\ g_{F} (β_{j}) & = & | β_{j, 1} | + \sum_{i = 2}^{p_{j}} | β_{j, i} - β_{j, i - 1} | = : \sum_{i = 1}^{p_{j}} | β_{j, i}^{F} |, \\ g_{T} (β_{j}) & = & | β_{j, 1} | + | β_{j, 2} - 2 β_{j, 1} | + \sum_{i = 3}^{p_{j}} | β_{j, i} - 2 β_{j, i - 1} + β_{j, i - 2} | = : \sum_{i = 1}^{p_{j}} | β_{j, i}^{T} | . \end{matrix}

(13)

The regular Lasso penalises the difference of each category level to the intercept (

| β_{j} |

) and can be used for covariates without ordinal scale. The fused Lasso penalises the difference between two adjacent category levels (

| β_{j} - β_{j - 1} |

) and is, therefore, suitable for fusing neighbouring category levels. Finally, the trend filtering penalises the difference in the linear trend between category levels (

| β_{j} - 2 β_{j - 1} + β_{j - 2} |

). It is, hence, used to model a piecewise linear and often monotone structure within that covariate. For trend filtering, we start with the middle category level as the intercept, as this category has a material exposure in any case. For the MLR, the equations can be adjusted according to the following:

log Likelihood (β | x, y) = \frac{1}{n} \sum_{i = 1}^{n} [\sum_{l = 1}^{m} (y_{i, l} (x_{i}^{T} β_{l})) - log (\sum_{l = 1}^{m} e^{x_{i}^{T} β_{l}})],

(14)

and

- log Likelihood {(β | x, y)}_{L a s s o} = - log Likelihood (β | x, y) + λ \sum_{j = 1}^{J} \sum_{l = 1}^{m} g_{P e n_{j, l}} (β_{j, l}) .

(15)

These extended versions of the Lasso allow the model to capture structures within covariates—however, only for the effect on a specific value of the response variable. A penalisation across different values of the response variable is not possible in this modelling setup. For example, two different coefficients for the effect of a covariate, j, on lapse,

β_{j_{1}, L}

and

β_{j_{2}, L}

, may be fused together. Similarly, the two different coefficients for the effect of covariate j on paid-up

β_{j_{1}, P}

and

β_{j_{2}, P}

may be fused together. However, a fusing between

β_{j_{1}, L}

and

β_{j_{1}, P}

or between

β_{j_{2}, L}

and

β_{j_{2}, P}

is not possible. To our knowledge, this form of optimisation has not been implemented yet.

For the implementation, we essentially use the same setup as described in Reck et al. (2023), which can be summarised by

Using the R Core Team (2022) interface for the h2o library—see LeDell et al. (2022);
Modelling the trend and fused Lasso penalty with contrast matrices;
Determining the hyper-parameter $λ$ based on a 5-fold cross-validation using the one standard error (1-se) rule4.

3.3. Specific Model Setup and Parameter Estimation

For the decomposition strategies, Equation (12) can be applied for each binary model independently, based on the coding matrices, M. This results in the calibration of three independent binary models (see

M_{OVA}

and

M_{OVO}

) or two independent binary models (see

M_{Nested A}

,

M_{Nested B}

and

M_{Nested C}

). Therefore, we perform three (two) independent penalised maximum likelihood estimations using Equation (12) with corresponding

λ^{i}

and

β^{i}

. The calibration of the binary models is independent, and therefore, the values of penalisation terms differ in general; see Table 1. In contrast, the MLR only requires a single model calibration using Equation (15).

Table 1 shows the main findings with respect to the penalisation term

λ

:

The second model of each nested approach is identical to a model in the OVO approach. Of course, this can also be seen when comparing the columns of $M_{O V O}$ in Section 2.2 with $M_{Nested A}, M_{Nested B}$ and $M_{Nested C}$ in Section 2.3.
Splitting the data set requires an additional model which is identical for all approaches (cf. the last entry in the last column). For the subset with the initial state ‘active’, the number of models is identical to the previous number of models (including both initial states) because the corresponding response variable can still have all three states, ‘active’, ‘paid-up’, and ‘lapse’. For the subset with the initial state ‘paid-up’, however, one additional model is required with possible levels ‘paid-up’ and ‘lapse’ for the response variable. It is just a single logistic regression with an initial state, ‘paid-up’, and response, ‘paid-up’ or ‘lapse’.
Whenever a model distinguishes class P from one (or all) other classes, the corresponding $λ$ value is rather high—especially when P and L are compared (see Nested A, second model). A plausible interpretation might be that separating class P is comparably easy for a model in the sense that the model performance does not decrease when the penalisation is increased.
The decomposition strategies have a higher degree of freedom in terms of the $λ$ value because they might differ for the individual binary models. In this application, however, the $λ$ values seem to have a similar magnitude across the different modelling approaches. Note that we also optimised the penalised likelihood functions from the decomposition strategies with the restriction of a constant penalisation term, $λ^{i}$ , for all binary models. As expected, the impact on the results was rather small. This might be different in applications where the independently calibrated values vary more. In the end, we chose the penalisation terms of Table 1, which is consistent with the independent model definitions.

In principle,

P e n_{j, l}

may also differ for the individual binary models within a decomposition strategy. This again leads to a high degree of flexibility; e.g., if

P e n_{entry age, A}

= ‘trend filtering’ and

P e n_{entry age, P}

= ‘fused’. However, this flexibility did not seem to impact the result significantly, such that we chose the same penalty type for all models, as described in Table 2. The different penalty types allow the model to identify different structures within covariates, as described in Section 3.2. The choice of the appropriate penalty term, therefore, highly depends on the structure of the corresponding covariate. For more details and the mathematical matrix representation, we refer to Reck et al. (2023) or R. J. Tibshirani and Taylor (2011).

Finally, the estimated coefficients are fed into Equation (10). For the decomposition strategies, the resulting probabilities are then transformed using the aggregation schemes presented in Section 2. For the MLR, the resulting probabilities are directly well defined and do not require further modifications.

4. Results and Comparison of the Modelling Approaches

After we have described the different modelling approaches, as well as the calibration to the underlying data set, we now analyse and compare the results. We focus on two dimensions: the different model architectures (see Section 2.1, Section 2.2, Section 2.3 and Section 2.4) and the inclusion of the transition history (see Section 2.5). We have seen that, due to their architecture, some of the models require considerable effort and readjustment to generate multinomial probabilities. Therefore, we also include the complexity (number of models and number of parameters) and the computing time of the models as additional components that are obviously important for the model selection.

Table 3 shows the two dimensions of the analysis: the rows show different model architectures, and the columns show different ways of including the transition history. Reck et al. (2023) also analyse further perspectives in a sensitivity analysis. They find that the Lasso approach offers advantages with respect to performance compared to univariate approaches like Whittaker–Henderson. It also has advantages with respect to the model complexity (measured with the number of parameters) without a loss in performance compared to a GLM. The latter is also true for ridge regression and the elastic net. We expect similar results for this analysis.

Furthermore, the extended versions of the Lasso (regular, fused, and trend filtering) allow for differentiated modelling of structures within covariates without adding too many parameters to the model. Hence, the selective property of the Lasso omits bins or category levels from the model. The choice of

λ

using the 1-se rule also supports the selection of a robust model with competitive performance.

The table is split into three parts: the upper part shows the performance of the models. The performance measure is defined as

1 - \frac{D_{m}}{D_{0}}

, where

D_{m}

corresponds to the deviance of the model, and

D_{0}

corresponds to the deviance of the intercept-only model (or null model). Therefore, it can be interpreted as the relative improvement over the intercept-only model. This measure is similar to the

R^{2}

measure for normally distributed response variables, with a similar interpretation. Since the measure is based on deviance, it is also consistent with the likelihood optimisation described in Section 3.2. Other performance measures (e.g., the multi-class area under the curve (AUC) as defined by Hand and Till 2001) show a similar pattern. The middle part of the table shows the number of models, the number of parameters, and the number of potential parameters. An entry,

a, b / c

, thus means that a individual models were built to get the overall prediction, and a total of b parameters were selected via the underlying Lasso out of c possible parameters from the underlying data set (or data sets). Parameters where the Lasso assigns a value of zero are not included in this entry b, as they implicitly disappear from the model. The lower part of the table shows the computing time (in minutes) on a standard computer. The aggregation scheme for the OVA model is very simple, so there is no measurable effort here. For the OVO model, however, the aggregation effort is considerable. For this model, the aggregation effort is given in parentheses.

Figure 7 visualises a part of the information from Table 3. The different models (except the intercept-only model) are visualised by colours, and the different transition histories are visualised by shapes. The x-axis shows the number of parameters, and the y-axis shows the relative improvement over the intercept-only model.

4.1. Transition History

Markov property (vs. no previous information): We can clearly see in the figure (or in the Markov property column) that the previous state contains valuable information. All models perform significantly better with the previous state compared to the corresponding model without any previous information (between 8 and 17 percentage points). This is intuitive since the fact that a contract was already paid up before impacts the chance of it being paid up in the current year. Since only one potential coefficient is added to the model (

β_{p r e v i o u s s t a t u s}

), the number of parameters does not change a lot (all changes within plus and minus 15%) and for most of the models even decreases, e.g., from 112 to 103 for Nested L. This is an indicator that the additional covariate

β_{p r e v i o u s s t a t u s}

contains valuable information such that other covariates are then no longer required in the model. Of course, this highlights one of the major advantages of using the Lasso approach: it selects the covariates automatically.

Full transition history (vs. Markov property): The full transition history, including the time since paying up, does not seem to improve the model significantly. Even though it increases the number of parameters (e.g., by 31% for the Nested L), the deviance is equal to or only marginally better than the deviance of the model with only the previous state (all changes less than half a percentage point). This holds true for all models. Therefore, the time since paying up does not seem to add value to the model - as long as the previous state is included.

Markov property with interaction (vs. Markov property): When using the previous state, including its interaction terms, the performance is somewhat better than that of the corresponding model without interactions (up to 2.2 percentage points). This illustrates the selective property of the Lasso: Models with these interaction terms are able to recognise different structures for the impact of a covariate on the target variable, depending on the initial state. However, the number of parameters increases significantly (e.g., almost doubles for the MLR). This is not surprising, as an interaction term is included for every covariate, i.e., the number of potential parameters is essentially doubled (the previous state can be A or P). If the number of initial states increases further, the number of interaction terms increases accordingly.

Markov property with splitting (vs. Markov property): Using the previous state by splitting the data set also seems to perform somewhat better than the corresponding model using the previous state as a covariate (up to 2.1 percentage points). The number of parameters seems to be on a similar level (some are higher; some are lower). Note that there is always one additional model when splitting the data set. As described above, this additional model is exactly the same for all modelling approaches.

Markov property with splitting (vs. Markov property with interaction): In terms of model performance, there is no material difference between splitting of the data set and allowance for interactions (a decrease by 0.1 to 0.2 percentage points). However, the models based on splitting the data set have much fewer parameters (between c. 20% and 40%).

Qualitative comparison: We analysed and compared different alternatives for the inclusion of the transition history of a contract. Increasing the number of potential states, the first one (adding the previous state as a covariate) increases the complexity of the model only marginally. The second one (adding the previous state and its interaction terms) can also be generalised to more potential states—however, this increases the number of parameters significantly. The third one (splitting the data set) requires further splits and might not be feasible for many more states—especially when some states only show a few observations. The inclusion of the ‘full transition history’ in a single covariate is only possible for our specific example. In general, with an increasing number of states, several covariates are necessary to replicate the full transition history.

4.2. Modelling Approaches

Quantitative comparison: Overall, the different models show similar performances, e.g., in the range of 46.6–51.4% when including the previous state as a covariate. The OVO approach achieves the best deviance across different ways of including the transition history. Within the nested models, the order seems to play an important role for the performance. Without further empirical knowledge, Nested A may seem like a good choice, as the majority class is separated in the first step (actives vs. paid-up/lapse). However, this model performs the worst among all models. To find the best nested model, all possible orders have to be analysed, which can be very time-consuming. In this case, nested L performs the best among the nested models (first separating lapse from active/paid-up). The performance of the OVA, nested P, and MLR are similar in terms of deviance.

Figure 8 shows the predictions of the models (the black line) using a similar format as the lower part of Figure 6. All modelling approaches show a similar shape for the predicted lapse rate, i.e., a strongly decreasing trend for the first two contract years, followed by a constant or slightly increasing trend until year 11 before eventually decreasing again until year 20. The shapes of the predicted paid-up rates are also similar for the different models. The multivariate predictions (with respect to contract duration) are consistent for all approaches.

For the number of parameters, the MLR offers an advantage over the other approaches. After that, the nested models follow. OVO and especially OVA have the most parameters.

The computing time shows that the MLR is comparatively slow (it takes about twice as long as a nested case). The penalised maximum likelihood optimisation is based on a multinomial distribution here, compared to binomial distributions for the other approaches. Presumably, the numerical optimisation is more complex for MLR. The OVO model is a clear outlier due to the time-consuming aggregation that is much more expensive than the actual calibration of the binary models. Note that this aggregation must also be applied for each individual case so that the effort is also incurred in the application of the model and not only in the calibration. If the number of classes increases, we would expect a similar computing time for the multinomial model. For the decomposition models, however, the computing time is expected to increase rapidly.

Qualitative comparison: We now compare the qualitative aspects of the models: the complexity and interpretability of the model architecture, as well as the ability to generalise to more potential states. We also show the marginal effect of the models with respect to the covariate contract duration; see Figure 9. For that, we use the estimated coefficients for contract duration (and the intercept) in each individual model (and set all other coefficients to zero) to predict the individual probabilities. Then, we use the same aggregation scheme as for the overall prediction. Especially the OVO aggregation changes the interpretability of the individual probabilities significantly. The corresponding line, therefore, only partly reflects the overall model behaviour for the OVO. In general, the shape is similar for the models with a decreasing trend for the first three years, followed by a slightly increasing trend until year six. After that, there is a constant (or only slightly decreasing) trend until year eleven, followed by a clear trend change resulting in a decreasing trend until the end. The Lasso approach, therefore, decreases the number of parameters significantly by grouping certain category levels. For example, instead of the six original category levels 6, 7, 8, 9, 10, and 11 for contract duration, we might only need to consider one group 6–11 (depending on the modelling approach). Since we used the trend filtering for contract duration, one group follows one linear trend and between different groups are then trend changes.

It is also noteworthy that the marginal effect of the OVO approach is almost identical to the marginal effect of the Nested P model. This is due to the fact that the first and third models in the OVO approach, as well as the first model in the nested P approach, hardly use the contract duration as a predictor. Instead, they focus on other covariates (like

β_{p r e v i o u s s t a t u s}

), which are set to 0 for the marginal plot and are, therefore, omitted in the prediction. Thus, the two lines appear to be congruent. The OVA and OVO approaches have a rather simple architecture and are still easy to generalise. The number of required models is

O (m)

for the OVA approach and

O (m^{2})

for the OVO approach. Both approaches require an aggregation scheme to obtain the final probability for each class. The OVA aggregation scheme basically builds a weighted sum of the individual models and, therefore, remains fully interpretable. The OVO aggregation scheme is rather complex, such that there is no direct and interpretable connection between the final prediction and the individual predictions. This is a major disadvantage of the OVO approach.

The nested models have a more complex architecture. The order of the classes is critical here, which makes this approach unfeasible for situations with a higher number of classes. This can be seen in Figure 9, as the marginal effects for the nested approaches differ significantly. Therefore, the generalisation to more potential states is not trivial. The number of possible orders is

O (2^{m} m!)

, and the number of required models for each order is

O (m)

. Although the aggregation scheme seems rather intuitive, the multiplication of models complicates the model interpretability. Thus, the nested approach has a qualitative disadvantage in terms of the loss of generalisation and interpretability.

The MLR offers the most qualitative advantages: It is a single model (i.e.,

O (1)

) and is, therefore, easy to set up. It does not require any aggregation, and coefficients can be explained and interpreted directly. The model has the ability to include more states without the need for further models.

The upper parts of the tables in Appendix A contain the estimated model parameters of the MLR, including the previous state as a covariate. We see the main advantages of the Lasso:

The model is easy to interpret, such that the important drivers for different policyholder options can be identified and evaluated. E.g., the lapse rate parameters, as shown in Figure 9 (marginal multinomial), can be found in Table A2 with a negative parameter in year 11 (corresponding to the negative trend) and a positive parameter in year 15 (corresponding to the weakened negative trend).

The model is robust to data specifics; e.g., the parameter for the transition to the active state is extremely low when the previous state is paid up (see Table A15). This is intuitive for our portfolio since we do not observe any reinstatements. A similar effect can be seen for the paid-up premium payment frequency. The intercept category is a single premium for each state. However, a single-premium policy cannot transfer to the paid-up state. This is basically offset by the positive coefficient in the annual premium frequency category (see Table A6).

The model is flexible, as a further covariate (calendar year) can be integrated without much effort. We show the effect of the covariate in a sensitivity analysis in the lower part of Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14, Table A15 and Table A16 in Appendix A. The vast majority of the estimated model parameters are very robust with respect to the added covariable and do not change significantly. Also, the model performance remains almost unchanged (deviance improvement: 48.5 vs. 48.2). The effect of controlling for this covariate would, therefore, be very small. We can see the calendar year effect in Table A16, e.g., with increased paid-up rates in the years 2004 and 2005, which can be explained by regulatory changes. Interestingly, the global financial crises in 2007 and 2008 seem to have had little effect on the transition rates in our analysis. An important topic for risk management is the extrapolation to the future. When the calendar year is used as a covariate, it requires further assumptions to extrapolate the calendar year effect before the model can be applied. The model without the calendar year as a covariate can extrapolate future years immanently by extrapolating the recent trend for the covariate contract duration.

5. Conclusions

In the previous sections, we derived and compared quantitative and qualitative aspects for the modelling of multiple transition probabilities. Among the analysed models, OVO and nested L showed the best performance in terms of model deviance improvement. For MLR, OVA, and Nested P, the deviance was only slightly higher (with mostly no material differences between the models). The Nested A model performed significantly worse. In terms of the number of parameters, the MLR showed clear advantages.

We also analysed and compared different ways of incorporating the transition history in the models. In general, the information contained in the transition history of an insurance contract should be considered, as it improved the predictions for all models. Including the full transition history did not further improve the model, and the previous status seemed to contain all necessary information. Assuming the Markov property and including interaction terms with the previous state performed best and also achieved higher flexibility than the separation of the data set or the consideration as a simple covariate.

Although the models showed comparable quantitative results, they differed significantly in several qualitative aspects: The OVA/OVO and MLR can be generalised (in the sense of adding further classes/status transitions) and remain interpretable—especially the MLR, as it only consists of a single model. The OVO model lacks a clear interpretation in the aggregation step. Due to the many different ways of setting up the nested architecture, the nested modelling approach is more difficult to generalise. Overall, the MLR achieves clear qualitative advantages since no aggregation scheme is required, and no further individual models are needed if further classes are added.

In a model-selection process, qualitative and quantitative criteria should always be considered carefully. Depending on the application, the importance of one or the other varies. This analysis should, therefore, be of interest to anyone who wants to consistently model multiple transition probabilities.

Our analysis points to several fields for further research and is also limited to the data under consideration, e.g., a finer distinction of the insurance type (e.g., term life, whole life, the existence of riders, differences in riders, etc.), as well as the inclusion of options (e.g., top-ups and premium holidays), can be relevant for other portfolios and should be analysed if the data are available. It is, therefore, not possible to transfer the estimated parameters directly to other portfolios. Another example is that the model flexibility of the MLR is still rigid, as penalisation across different values of the response variable is not possible yet. However, a fusing between parameters for different values of the response variable, e.g.,

β_{a g e 1, L}

and

β_{a g e 1, P}

, may further increase the accuracy of the MLR.

Author Contributions

Methodology, L.R. and J.S.; software, L.R. and J.S.; validation, J.S. and A.R.; formal analysis, L.R.; data curation, L.R.; writing—original draft preparation, L.R.; writing—review and editing, J.S. and A.R.; visualization, L.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Restrictions apply to the dataset. It was provided by a life insurer for research purposes. The dataset is not publicly accessible as it contains sensitive company-related data.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

We use the following notation throughout the paper:

Notation/Abbreviation	Explanation
GLM	Generalised linear model
MLR	Multinomial logistic regression
GBM	Gradient boosting machine
Lasso	Least absolute shrinkage and selection operator
OVA	One versus all model
OVO	One versus one model
Y	Response variable (dependent variable)
X	Covariate matrix (independent variables)
K	Set of possible classes
m	Number of classes, i.e., $m = \| K \|$
n	Number of observations
J	Number of covariates
$β$	Model parameter vector
$λ$	Hyperparameter in the Lasso model controlling the penalisation strength
A	Active state
P	Paid-up state
L	Lapse state

Appendix A. Results of the MLR

Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14, Table A15 and Table A16 contain the estimated model parameters

β

for the MLR model. Note that the individual model parameters refer to

β^{R}

,

β^{F}

and

β^{T}

from Equation (13). Hence, depending on the penalty type of the covariate as defined in Table 2, the individual values need to be transformed to derive the plain

β

value. The tables also contain a sensitivity analysis when the calendar year is used as an additional covariate.

Table A1. Result of the MLR for the intercept. Values for

β