# Application of Machine Learning to Mortality Modeling and Forecasting

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. The Model

- ${D}_{\mathbf{x}}$ are independent in $\left(\right)$;
- ${D}_{\mathbf{x}}\sim Pois({m}_{\mathbf{x}}\xb7{E}_{\mathbf{x}})$ for all $\left(\right)$.

- ${m}_{\mathbf{x}}={m}_{\mathbf{x}}^{\mathrm{mdl}}$
- ${D}_{\mathbf{x}}\sim Pois({\psi}_{\mathbf{x}}\xb7{d}_{\mathbf{x}}^{\mathrm{mdl}}),\phantom{\rule{2.em}{0ex}}\mathrm{with}\phantom{\rule{1.em}{0ex}}{\psi}_{\mathbf{x}}\equiv 1,\phantom{\rule{1.em}{0ex}}{d}_{\mathbf{x}}^{\mathrm{mdl}}={m}_{\mathbf{x}}^{\mathrm{mdl}}{E}_{\mathbf{x}}$

- Decision tree
- Random forest
- Gradient boosting

#### 2.1. Decision Trees

`rpart`(Therneau and Atkinson 2017). The algorithm provides the estimate of $\widehat{\psi}(\mathbf{x})$ given by the average of the response variable values $\psi (\mathbf{x})$ belonging to the same region identified by the regression tree. The values of the complexity parameter ($cp$) for the decision trees are chosen with the aim of making the number of splits considered uniform.

#### 2.2. Random Forest

`randomForest`(Liaw 2018). Since this procedure proved to be very costly from a computational point of view, the number of trees must be carefully chosen: it should not be too large, but at the same time able to produce an adequate percentage of variance explained and a low mean of squared residuals, MSR.

#### 2.3. Gradient Boosting

`gbm`(Ridgeway 2007). The

`gbm`package requires choosing the number of trees ($n.trees$) and other key parameters as the number of cross-validation folds ($cv.folds$), the depth of each tree involved in the estimate ($interaction.depth$), and the learning rate parameter ($shrinkage$). The number of trees, representing the number of GB iterations, must be accurately chosen, as a high number would reduce the error on the training set, while a low number would result in overfitting. The number of cross-validation folds to perform should be chosen according to the dataset size. In general, five-fold cross-validation, which corresponds to $20\%$ of the data involved in testing, is considered a good choice in many cases. Finally, the interaction depth represents the highest level of variable interactions allowed or the maximum nodes for each tree.

## 3. Mortality Models

- ${\alpha}_{a}$: age-specific parameter providing the average age profile of mortality;
- ${\beta}_{a}^{(i)}\xb7{\kappa}_{t}^{(i)}$, $\forall i$: age-period terms describing the mortality trends (${\kappa}_{t}^{(i)}$ is the time index, and ${\beta}_{a}^{(i)}$ modifies the effect of ${\kappa}_{t}^{(i)}$ across ages);
- ${\beta}_{a}^{(0)}\xb7{\gamma}_{t-a}$: represents the cohort effect, where ${\gamma}_{t-a}$ is the cohort parameter and ${\beta}_{a}^{(0)}$ modifies its effect across ages ($c=t-a$ is the year of birth).

#### 3.1. Lee-Carter Model

#### 3.2. Renshaw-Haberman Model

#### 3.3. Plat Model

## 4. Numerical Analysis

#### 4.1. Model Fitting

`StMoMo`package provided by Villegas et al. (2015). The following sets of gender $\mathcal{G}$, ages $\mathcal{A}$, years $\mathcal{T}$, and cohort $\mathcal{C}$ were considered in the analysis:

#### 4.2. Model Fitting Improved by Machine Learning

`rpart`,

`randomForest`, and

`gbm`packages, respectively:

- ${\widehat{\psi}}_{\mathbf{x}}^{\mathrm{mdl},\mathrm{DT}}$ was estimated with the
`rpart`package by setting: $cp$ = 0.003 (complexity parameter); - ${\widehat{\psi}}_{\mathbf{x}}^{\mathrm{mdl},\mathrm{RF}}$ was estimated with the
`randomForest`package by setting: $ntrees$ = 200 (number of trees). Since this procedure proved to be very costly from a computational point of view, we limited the number of trees to 200, in order to guarantee both an adequate percentage of variance explained by the model and a low mean of squared residuals, MSR (see Table 2); - ${\widehat{\psi}}_{\mathbf{x}}^{\mathrm{mdl},\mathrm{GB}}$ is estimated with the
`gbm`package by setting: $n.trees$ = 5000 (number of trees); $cv.folds$ = 5 (number of cross-validation folds); $interaction.depth$ = 6; $shrinkage$ = 0.001 (learning rate) according to the algorithm implementation speed. The parameter $cv.folds$ is used to estimate the optimal number of iterations through the function $gbm.perf$ (see Figure 2).

#### 4.3. LC Model Forecasting Improved by Machine Learning

#### Goodness of Forecasting

#### 4.4. Results for a Shorter Estimation Period: 1960–2014

#### Goodness of Forecasting

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A

#### Appendix A.1 Plots for Time Period 1915–2014

**Figure A1.**Values of $\Delta {m}_{\mathbf{x}}^{\mathrm{mdl},\mathrm{ML}}$. Italian female population. Ages 0–100 and years 1915–2014.

**Figure A2.**${\kappa}_{t}^{(1)}$ and ${\kappa}_{t}^{(1,\psi )}$: Fitted value (1915–2000) and forecasted values (2000–2014).

#### Appendix A.2 Plots for Time Period 1960–2014

**Figure A3.**Values of $\Delta {m}_{\mathbf{x}}^{\mathrm{mdl},\mathrm{ML}}$. Italian female population. Ages 0–100 and years 1960–2014.

**Figure A4.**${\kappa}_{t}^{(1)}$ and ${\kappa}_{t}^{(1,\psi )}$: Fitted value (1960–2000) and forecasted values (2000–2014).

## References

- Alpaydin, Ethem. 2010. Introduction to Machine Learning, 2nd ed. Cambridge: Massachusetts Institute of Technology Press, ISBN 026201243X. [Google Scholar]
- Breiman, Leo, Jerome Friedman, Richard Olshen, and Charles Stone. 1984. Classification and Regression Trees. Boca Raton: CRC Press, ISBN 9780412048418. [Google Scholar]
- Breiman, Leo. 2001. Random forests. Machine Learning 45: 5–32. [Google Scholar] [CrossRef]
- Brouhns, Natacha, Michel Denuit, and Jeroen K. Vermunt. 2002. A Poisson log-bilinear regression approach to the construction of projected lifetables. Insurance: Mathematics and Economics 31: 373–93. [Google Scholar] [CrossRef]
- Cairns, Andrew J. G., David Blake, Kevin Dowd, Guy D. Coughlan, David Epstein, Alen Ong, and Igor Balevich. 2009. A quantitative comparison of stochastic mortality models using data from England and Wales and the United States. North American Actuarial Journal 13: 1–35. [Google Scholar] [CrossRef]
- Deprez, Philippe, Pavel V. Shevchenko, and Mario V. Wüthrich. 2017. Machine learning techniques for mortality modeling. European Actuarial Journal 7: 337–52. [Google Scholar] [CrossRef]
- Haberman, Steven, and Arthur Renshaw. 2011. A comparative study of parametric mortality projection models. Insurance: Mathematics and Economics 48: 35–55. [Google Scholar] [CrossRef] [Green Version]
- Hainaut, Donatien. 2018. A neural-network analyzer for mortality forecast. Astin Bulletin 48: 481–508. [Google Scholar] [CrossRef]
- Hastie, Jerome, Trevor Hastie, and Robert Tibshirani. 2016. The Elements of Statistical Learning, 2nd ed. Data Mining, Inference, and Prediction. New York: Springer, ISBN 0387848576. [Google Scholar]
- Hunt, Andrew, and Andrés M. Villegas. 2015. Robustness and convergence in the Lee-Carter model with cohorts. Insurance: Mathematics and Economics 64: 186–202. [Google Scholar] [CrossRef]
- James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. An Introduction to Statistical Learning: With Applications in R. New York: Springer, ISBN 1461471370. [Google Scholar]
- Lee, Ronald D., and Lawrence R. Carter. 1992. Modeling and forecasting US mortality. Journal of the American Statistical Association 87: 659–71. [Google Scholar]
- Liaw, Andy. 2018. Package
`Randomforest`. Available online: https://cran.r-project.org/ web/packages/randomForest/randomForest.pdf (accessed on 21 May 2018). - Plat, Richard. 2009. On stochastic mortality modeling. Insurance: Mathematics and Economics 45: 393–404. [Google Scholar]
- Renshaw, Arthur E., and Steven Haberman. 2006. A Cohort-Based Extension to the Lee-Carter Model for Mortality Reduction Factors. Insurance: Mathematics and Economics 38: 556–70. [Google Scholar] [CrossRef]
- Richman, Ronald, and Mario V. Wüthrich. 2018. A Neural Network Extension of the Lee-Carter Model to Multiple Populations. Rochester: SSRN. [Google Scholar]
- Ridgeway, Greg. 2007. Generalized Boosted Models: A Guide to the
`gbm`Package. Available online: https://cran.r-project.org/web/packages/gbm/gbm.pdf (accessed on 21 May 2018). - Therneau, Terry M., and Elizabeth J. Atkinson. 2017. An Introduction to Recursive Partitioning Using the RPART Routines. Available online: https://cran.r-project. org/web/packages/rpart/vignettes/longintro.pdf (accessed on 21 May 2018).
- Villegas, Andrés M., Pietro Millossovich, and Vladimir K. Kaishev. 2015.
`Stmomo`: An r Package for Stochastic Mortality Modelling. Available online: https://cran.r-project.org/web/packages/StMoMo/vignettes/StMoMoVignette.pdf (accessed on 21 May 2018).

1 | The AIC and BIC statistics are both function of the log-likelihood, $\mathcal{L}$, and the number of parameters involved in the model, $\nu $: AIC $=2\nu -2\mathcal{L}$, and BIC $=\nu logN-2\mathcal{L}$, where N is the number of observations. |

2 |

**Figure 1.**Heat map of standardized residuals of the mortality models. Ages 0–100 and years 1915–2014, Italian population.

**Figure 2.**Estimates of the optimal number of boosting iterations for the LC, RH, and Plat model. Black line: Out-of-bag estimates; green line: cross-validation estimates.

**Figure 3.**Values of $\Delta {m}_{\mathbf{x}}^{\mathrm{mdl},\mathrm{ML}}$. Italian male population. Ages 0–100 and years 1915–2014.

**Figure 4.**Logarithm of mortality rate, ages 0–50. Italian male population. Examples for cohorts born in 1915, 1940, 1965, 1990, and 2014.

**Figure 5.**Values of $\Delta {m}_{\mathbf{x}}^{\mathrm{mdl},\mathrm{ML}}$. Italian male population. Ages 0–100 and years 1960–2014.

**Table 1.**Log-likelihood, AIC and BIC statistics for LC, RH and Plat model. Ages 0–100 and years 1915–2014, Italian population.

Gender: | Males | Females | ||||
---|---|---|---|---|---|---|

Model: | LC | RH | Plat | LC | RH | Plat |

$\nu $ | 300 | 499 | 496 | 300 | 499 | 496 |

$\mathcal{L}$ | −643,226 | −307,985 | −875,549 | −176,421 | −137,098 | −281,518 |

AIC (Rank) | 1,287,051 (2) | 616,967 (1) | 1,752,089 (3) | 353,442 (2) | 275,193 (1) | 564,028 (3) |

BIC (Rank) | 1,289,218 (2) | 620,570 (1) | 1,755,671 (3) | 355,608 (2) | 278,796 (1) | 567,610 (3) |

**Table 2.**Explained variance and MSR by the RF algorithm for the LC, RH, and Plat model. Ages 0–100 and years 1915–2014, Italian population.

Model | Explained Variance | MSR |
---|---|---|

LC | 96.25% | 0.0263 |

RH | 86.48% | 0.0057 |

Plat | 91.14% | 0.0058 |

**Table 3.**MAPE of fitted with respect to observed data for the LC, RH, and Plat model, before (No ML) and after (DT, RF, GB) the application of ML algorithms. Ages 0–100 and years 1915–2014, Italian population. Values in bold indicate the specification with the smaller MAPE for each model.

Gender | Males | Females | ||||
---|---|---|---|---|---|---|

Model | LC | RH | Plat | LC | RH | Plat |

No ML | $19.48\%$ | $18.03\%$ | $25.81\%$ | $13.83\%$ | $13.96\%$ | $22.34\%$ |

DT | $11.06\%$ | $11.57\%$ | $13.69\%$ | $9.42\%$ | $9.12\%$ | $11.53\%$ |

RF | 4.85% | 4.89% | 4.79% | 4.61% | 4.65% | 4.49% |

GB | $13.80\%$ | $11.35\%$ | $14.59\%$ | $8.48\%$ | $7.84\%$ | $9.77\%$ |

**Table 4.**Out-of-sample test results: RMSLE and RMSE for the LC model without and with machine learning. ML estimator ${\widehat{\psi}}_{\mathbf{x}}^{\mathrm{LC},\mathrm{ML}}$ modeled according to the LC framework. Years 2000–2014 (fitting period: 1915–2000). Values in bold indicate the model with the smaller RMSLE and RMSE.

RMSLE | RMSE | |||
---|---|---|---|---|

Model | Males | Females | Males | Females |

LC | 0.0290 | 0.0195 | 0.7282 | 0.7734 |

LC, DT | 0.0139 | 0.0068 | 0.3567 | 0.5351 |

LC, RF | 0.0083 | 0.0044 | 0.3624 | 0.1532 |

LC, GB | 0.0100 | 0.0056 | 0.3536 | 0.1841 |

**Table 5.**MAPE of fitted with respect to observed data for the LC, RH, and Plat model, before (No ML) and after (DT, RF, GB) the application of ML algorithms. Ages 0–100 and years 1960–2014, Italian population. Values in bold indicate the specification with the smaller MAPE for each model.

Gender | Males | Females | ||||
---|---|---|---|---|---|---|

Model | LC | RH | Plat | LC | RH | Plat |

No ML | 11.08% | 8.51% | 12.67% | 9.20% | 8.44% | 12.32% |

DT | 5.86% | 5.27% | 6.25% | 5.90% | 5.71% | 6.62% |

RF | 4.07% | 4.02% | 3.89% | 4.95% | 4.79% | 4.81% |

GB | 6.04% | 4.84% | 6.07% | 5.49% | 5.16% | 6.27% |

**Table 6.**Out-of-sample test results: RMSLE and RMSE for the LC model without and with machine learning. Years 2000–2014 (fitting period: 1960–2000). Values in bold indicate the model with the smaller RMSLE and RMSE.

RMSLE | RMSE | |||
---|---|---|---|---|

Model | Males | Females | Males | Females |

LC | 0.0309 | 0.0183 | 0.3298 | 0.1840 |

LC, DT | 0.0115 | 0.0074 | 0.3212 | 0.1789 |

LC, RF | 0.0100 | 0.0066 | 0.3095 | 0.1761 |

LC, GB | 0.0102 | 0.0070 | 0.3028 | 0.1730 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Levantesi, S.; Pizzorusso, V.
Application of Machine Learning to Mortality Modeling and Forecasting. *Risks* **2019**, *7*, 26.
https://doi.org/10.3390/risks7010026

**AMA Style**

Levantesi S, Pizzorusso V.
Application of Machine Learning to Mortality Modeling and Forecasting. *Risks*. 2019; 7(1):26.
https://doi.org/10.3390/risks7010026

**Chicago/Turabian Style**

Levantesi, Susanna, and Virginia Pizzorusso.
2019. "Application of Machine Learning to Mortality Modeling and Forecasting" *Risks* 7, no. 1: 26.
https://doi.org/10.3390/risks7010026