# A Comparative Analysis of Machine Learning Models for the Prediction of Insurance Uptake in Kenya

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Literature

#### 2.1. Insurance Uptake

#### 2.2. Use of Machine Learning Models

#### 2.2.1. Logistic Regression Classifier

#### 2.2.2. Support Vector Machines

#### 2.2.3. Gaussian Naive Bayes)

#### 2.2.4. K Nearest Neighbor

#### 2.2.5. Decision Trees

#### 2.2.6. Random Forest

#### 2.2.7. Gradient Boosting Machine and Extreme Gradient Boosting

#### 2.2.8. Deep Learning Classifiers

## 3. Methodology

#### 3.1. Data

#### 3.2. Preprocessing and Features’ Selection

#### 3.3. Handling Class Imbalance

#### 3.4. Model Performance Measures

#### 3.5. Hyperparameter Optimization

## 4. Results and Discussion

#### 4.1. Comparison on Unbalanced Data

#### 4.2. Comparison on Balanced Data

#### 4.3. Area under the Receiver Operating Characteristic Curves and Confusion Matrices

#### 4.4. Phase II Analysis: Comparison of Models on Oversampled Data

#### 4.5. Feature Importance

## 5. Conclusions and Recommendations

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

Parameter | Range | Optimal Value |
---|---|---|

n_estimators | [80 to 150, interval of 10] | 110 |

max_features | [auto, sqrt, log2] | auto |

min_samples_split | [2, 4, 6, 8] | 2 |

Bootstrap | [True, False] | True |

Parameter | Range | Optimal Value |
---|---|---|

n_estimators | [100, 200, 300] | 100 |

min_samples learning_rate | [0.01, 0.02, 0.05, 0.1] | 0.05 |

Parameter | Range | Optimal Value |
---|---|---|

n_estimators | [500 to 1500] | 1000 |

max_depth | [auto, sqrt, log2] | auto |

max_features | [0.2 to 1] | 0.9 |

gamma | [0.1 to 1] | 0.1 |

Parameter Range | Optimal Value | |
---|---|---|

n_Training epochs | [100, 200, 300] | 300 |

max_Batch size | [20, 40, 50, 100] | 50 |

max_Learning rate | [0.0005, 0.001, 0.01] | 0.001 |

Activation function | [softmax, ReLU] | ReLU |

Parameter Range | Optimal Value | |
---|---|---|

n_Training epochs | [100, 200, 300] | 300 |

max_Batch size | [20, 40, 50, 100] | 40 |

max_Learning rate | [0.0005, 0.001, 0.01] | 0.001 |

Activation function | [softmax, ReLU] | ReLU |

Parameter Range | Optimal Value | |
---|---|---|

n_Training epochs | [100,200,300] | 300 |

max_Batch size | [20, 40,50, 100] | 50 |

max_Learning rate | [0.0005,0.001,0.01] | 0.001 |

Activation function | [softmax, ReLU] | ReLU |

Parameter Range | Optimal Value | |
---|---|---|

n_Training epochs | [100, 200, 300] | 300 |

max_Batch size | [20, 40, 50, 100] | 40 |

max_Learning rate | [0.0005, 0.001, 0.01] | 0.001 |

Activation function | [softmax, ReLU] | ReLU |

## References

- Olayungbo, D.; Akinlo, A. Insurance penetration and economic growth in Africa: Dynamic effects analysis using Bayesian TVP-VAR approach. Cogent Econ. Financ.
**2016**, 4, 1150390. [Google Scholar] [CrossRef][Green Version] - Zhou, J.; Guo, Y.; Ye, Y.; Jiang, J. Multi-Label Entropy-Based Feature Selection with Applications to Insurance Purchase Prediction. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 27–29 June 2020; pp. 427–432. [Google Scholar]
- African Union Commission. Agenda2063-The Africa We Want; African Union Commission: Addis Ababa, Ethiopia, 2017. [Google Scholar]
- Lambregts, T.R.; Schut, F.T. A Systematic Review of the Reasons for Low Uptake of Long-Term Care Insurance and Life Annuities: Could Integrated Products Counter Them? Netspar: Tilburg, The Netherlands, 2019. [Google Scholar]
- AKI. Insurance Industry Annual Report 2015; Technical Report; Association of Kenya Insurers: Nairobi City, Kenya, 2015. [Google Scholar]
- Gine, X.; Ribeiro, B.; Wrede, P. Beyond the S-Curve: Insurance Penetration, Institutional Quality and Financial Market Development; The World Bank: Washington, DC, USA, 2019. [Google Scholar] [CrossRef]
- Venderley, J.; Khemani, V.; Kim, E.A. Machine learning out-of-equilibrium phases of matter. Phys. Rev. Lett.
**2018**, 120, 257204. [Google Scholar] [CrossRef][Green Version] - López Belmonte, J.; Segura-Robles, A.; Moreno-Guerrero, A.J.; Parra-González, M.E. Machine learning and big data in the impact literature. A bibliometric review with scientific mapping in Web of science. Symmetry
**2020**, 12, 495. [Google Scholar] [CrossRef][Green Version] - Grize, Y.L.; Fischer, W.; Lützelschwab, C. Machine learning applications in nonlife insurance. Appl. Stoch. Model. Bus. Ind.
**2020**, 36, 523–537. [Google Scholar] [CrossRef] - Krah, A.S.; Nikolić, Z.; Korn, R. Machine learning in least-squares Monte Carlo proxy modeling of life insurance companies. Risks
**2020**, 8, 21. [Google Scholar] [CrossRef][Green Version] - Bärtl, M.; Krummaker, S. Prediction of claims in export credit finance: A comparison of four machine learning techniques. Risks
**2020**, 8, 22. [Google Scholar] [CrossRef][Green Version] - Petrides, G.; Moldovan, D.; Coenen, L.; Guns, T.; Verbeke, W. Cost-sensitive learning for profit-driven credit scoring. J. Oper. Res. Soc.
**2020**, 1–13. [Google Scholar] [CrossRef] - Aghabozorgi, S.; Shirkhorshidi, A.S.; Wah, T.Y. Time-series clustering–a decade review. Inf. Syst.
**2015**, 53, 16–38. [Google Scholar] [CrossRef] - Pavlyshenko, B.M. Machine-learning models for sales time series forecasting. Data
**2019**, 4, 15. [Google Scholar] [CrossRef][Green Version] - Dashtipour, K.; Gogate, M.; Adeel, A.; Ieracitano, C.; Larijani, H.; Hussain, A. Exploiting deep learning for Persian sentiment analysis. In Proceedings of the International Conference on Brain Inspired Cognitive Systems, Xi’an, China, 7–8 July 2018; pp. 597–604. [Google Scholar]
- Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett.
**2010**, 31, 651–666. [Google Scholar] [CrossRef] - Tkáč, M.; Verner, R. Artificial neural networks in business: Two decades of research. Appl. Soft Comput.
**2016**, 38, 788–804. [Google Scholar] [CrossRef] - Sundarkumar, G.G.; Ravi, V. A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng. Appl. Artif. Intell.
**2015**, 37, 368–377. [Google Scholar] [CrossRef] - Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
- Huang, S.; Cai, N.; Pacheco, P.P.; Narrandes, S.; Wang, Y.; Xu, W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genom.-Proteom.
**2018**, 15, 41–51. [Google Scholar] - Naganandhini, S.; Shanmugavadivu, P. Effective Diagnosis of Alzheimer’s Disease using Modified Decision Tree Classifier. Procedia Comput. Sci.
**2019**, 165, 548–555. [Google Scholar] [CrossRef] - Olanow, C.W.; Koller, W.C. An algorithm (decision tree) for the management of Parkinson’s disease: Treatment guidelines. Neurology
**1998**, 50, S1. [Google Scholar] [CrossRef] - Muniyandi, A.P.; Rajeswari, R.; Rajaram, R. Network anomaly detection by cascading k-Means clustering and C4. 5 decision tree algorithm. Procedia Eng.
**2012**, 30, 174–182. [Google Scholar] [CrossRef][Green Version] - Blanco, C.M.G.; Gomez, V.M.B.; Crespo, P.; Ließ, M. Spatial prediction of soil water retention in a Páramo landscape: Methodological insight into machine learning using random forest. Geoderma
**2018**, 316, 100–114. [Google Scholar] [CrossRef] - Golden, C.E.; Rothrock, M.J., Jr.; Mishra, A. Comparison between random forest and gradient boosting machine methods for predicting Listeria spp. prevalence in the environment of pastured poultry farms. Food Res. Int.
**2019**, 122, 47–55. [Google Scholar] [CrossRef][Green Version] - Kim, T.Y.; Cho, S.B. Predicting residential energy consumption using CNN-LSTM neural networks. Energy
**2019**, 182, 72–81. [Google Scholar] [CrossRef] - Sun, J.; Di, L.; Sun, Z.; Shen, Y.; Lai, Z. County-level soybean yield prediction using deep CNN-LSTM model. Sensors
**2019**, 19, 4363. [Google Scholar] [CrossRef][Green Version] - Central Bank of Kenya; FSD Kenya; Kenya National Bureau of Statistics. FinAccess Household Survey 2015; Central Bank of Kenya: Nairobi, Kenya, 2016. [Google Scholar] [CrossRef]
- Amin, A.; Anwar, S.; Adnan, A.; Nawaz, M.; Howard, N.; Qadir, J.; Hawalah, A.; Hussain, A. Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access
**2016**, 4, 7940–7957. [Google Scholar] [CrossRef] - Pawluszek-Filipiak, K.; Borkowski, A. On the Importance of Train–Test Split Ratio of Datasets in Automatic Landslide Detection by Supervised Classification. Remote Sens.
**2020**, 12, 3054. [Google Scholar] [CrossRef] - Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
- Han, T.; Siddique, A.; Khayat, K.; Huang, J.; Kumar, A. An ensemble machine learning approach for prediction and optimization of modulus of elasticity of recycled aggregate concrete. Constr. Build. Mater.
**2020**, 244, 118271. [Google Scholar] [CrossRef] - Casalicchio, G.; Molnar, C.; Bischl, B. Visualizing the feature importance for black box models. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Dublin, Ireland, 10–14 September 2018; pp. 655–670. [Google Scholar]
- Pesantez-Narvaez, J.; Guillen, M.; Alcañiz, M. Predicting motor insurance claims using telematics data—XGBoost versus logistic regression. Risks
**2019**, 7, 70. [Google Scholar] [CrossRef][Green Version]

Number | Model | Precision | Recall | F1-Scores | Accuracy |
---|---|---|---|---|---|

0 | Logistic | 0.6855 | 0.5430 | 0.6060 | 0.8485 |

1 | GNB | 0.4576 | 0.7258 | 0.5613 | 0.7565 |

2 | Random Forest | 0.6301 | 0.3907 | 0.4823 | 0.8200 |

3 | DT | 0.4339 | 0.4821 | 0.4567 | 0.7538 |

4 | SVM | 0.7265 | 0.4857 | 0.5822 | 0.8504 |

5 | KNN | 0.5781 | 0.4910 | 0.5310 | 0.8138 |

6 | GBM | 0.7054 | 0.5108 | 0.5925 | 0.8492 |

7 | XGB | 0.7204 | 0.5126 | 0.5990 | 0.8527 |

Number | Model | Precision | Recall | F1-Scores | Accuracy |
---|---|---|---|---|---|

0 | Logistic | 0.7726 | 0.7228 | 0.7423 | 0.8442 |

1 | GNB | 0.6912 | 0.7507 | 0.7059 | 0.7712 |

2 | Random Forest | 0.7456 | 0.6748 | 0.6977 | 0.8269 |

3 | DT | 0.6474 | 0.6493 | 0.6483 | 0.7654 |

4 | SVM | 0.7704 | 0.6809 | 0.7080 | 0.8365 |

5 | KNN | 0.7495 | 0.7155 | 0.7297 | 0.8327 |

6 | GBM | 0.7720 | 0.7115 | 0.7338 | 0.8423 |

7 | XGB | 0.7630 | 0.6944 | 0.7182 | 0.8366 |

Number | Model | Precision | Recall | F1-Scores | Accuracy |
---|---|---|---|---|---|

0 | Logistic | 0.7775 | 0.7776 | 0.7775 | 0.7775 |

1 | GNB | 0.7440 | 0.7436 | 0.7432 | 0.7433 |

2 | Random Forest | 0.9493 | 0.9467 | 0.9462 | 0.9462 |

3 | DT | 0.9311 | 0.9250 | 0.9240 | 0.9242 |

4 | SVM | 0.8193 | 0.8192 | 0.8191 | 0.8191 |

5 | KNN | 0.8328 | 0.8250 | 0.8231 | 0.8240 |

6 | GBM | 0.7921 | 0.7922 | 0.7922 | 0.7922 |

7 | XGB | 0.7874 | 0.7874 | 0.7873 | 0.7873 |

Number | Model | Precision | Recall | F1-Scores | Accuracy |
---|---|---|---|---|---|

0 | Logistic | 0.8292 | 0.8314 | 0.8297 | 0.8303 |

1 | GNB | 0.8200 | 0.8216 | 0.8205 | 0.8214 |

2 | Random Forest | 0.8207 | 0.8232 | 0.8219 | 0.8214 |

3 | DT | 0.7210 | 0.7202 | 0.7205 | 0.7232 |

4 | SVM | 0.8125 | 0.8150 | 0.8121 | 0.8125 |

5 | KNN | 0.8207 | 0.8232 | 0.8209 | 0.8214 |

6 | GBM | 0.8558 | 0.8576 | 0.8564 | 0.8571 |

7 | XGB | 0.8649 | 0.8674 | 0.8655 | 0.8661 |

Number | Model | TP | TN | FP | FN |
---|---|---|---|---|---|

0 | Logistic | 164 | 154 | 40 | 51 |

1 | GNB | 159 | 191 | 45 | 14 |

2 | Random Forest | 190 | 201 | 14 | 4 |

3 | DT | 170 | 204 | 34 | 1 |

4 | SVM | 158 | 191 | 46 | 14 |

5 | KNN | 159 | 191 | 45 | 14 |

6 | GBM | 169 | 166 | 35 | 37 |

7 | XGB | 179 | 190 | 25 | 15 |

Number | Model | Precision | Recall Score | F1-Scores | Accuracy | AUC |
---|---|---|---|---|---|---|

0 | Random Forest | 0.94175 | 0.93863 | 0.93932 | 0.93953 | 0.98620 |

1 | XGBoost | 0.88645 | 0.87960 | 0.88040 | 0.88121 | 0.92690 |

2 | MLP | 0.82927 | 0.71429 | 0.76749 | 0.77754 | 0.85210 |

3 | CNN | 0.86142 | 0.96639 | 0.91089 | 0.90281 | 0.95113 |

4 | LSTM | 0.87405 | 0.96219 | 0.91600 | 0.90929 | 0.95592 |

5 | CNN-LSTM | 0.88031 | 0.95798 | 0.91751 | 0.91145 | 0.95666 |

Rank | Feature | Importance |
---|---|---|

1 | Having a bank product | 0.191 |

2 | Wealth quintile | 0.111 |

3 | Subregion | 0.109 |

4 | Education | 0.088 |

5 | Age group | 0.068 |

6 | Most trusted provider | 0.051 |

7 | Nature of residence | 0.050 |

8 | Numeracy | 0.048 |

9 | Household size | 0.047 |

10 | Marital status | 0.041 |

11 | 2nd most trusted provider | 0.039 |

12 | Ownership of a phone | 0.038 |

13 | Having a set emergency fund | 0.033 |

14 | Having electricity as a light source | 0.031 |

15 | Gender | 0.030 |

16 | Urban vs. rural | 0.026 |

17 | Being a youth | 0.026 |

18 | Having a smartphone | 0.023 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yego, N.K.; Kasozi, J.; Nkurunziza, J. A Comparative Analysis of Machine Learning Models for the Prediction of Insurance Uptake in Kenya. *Data* **2021**, *6*, 116.
https://doi.org/10.3390/data6110116

**AMA Style**

Yego NK, Kasozi J, Nkurunziza J. A Comparative Analysis of Machine Learning Models for the Prediction of Insurance Uptake in Kenya. *Data*. 2021; 6(11):116.
https://doi.org/10.3390/data6110116

**Chicago/Turabian Style**

Yego, Nelson Kemboi, Juma Kasozi, and Joseph Nkurunziza. 2021. "A Comparative Analysis of Machine Learning Models for the Prediction of Insurance Uptake in Kenya" *Data* 6, no. 11: 116.
https://doi.org/10.3390/data6110116