Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data
Abstract
1. Introduction
- Which factors affect the classification of insurance claims?
- Which of the machine learning models performs best in classifying insurance claims?
2. Literature Review
2.1. Claim Analysis
2.2. Fraud Detection
2.3. Premium Pricing and Sales Optimization
3. Data
3.1. Exploratory Data Analysis
3.2. Data Encoding
3.3. Data Transformation
3.4. Imbalanced Dataset
3.5. Training and Testing
4. Methods
4.1. Logistic Regression
4.2. Classification and Decision Tree (CART) Method
4.3. Random Forest
4.4. XGBoost
4.5. Support Vector Machine
4.6. K-Nearest Neighbors
4.7. Naïve Bayes
4.8. Model Evaluation
4.8.1. Confusion Matrix
4.8.2. Accuracy
4.8.3. Area Under the Curve (AUC)
4.8.4. Type I and Type II Errors
- Type I error = FP/(TN + FP);
- Type II error = FN/(TP + FN).
4.8.5. Hyperparameter Tuning
5. Results
5.1. Model Performance
5.2. Confusion Matrices Results
5.3. Results on Type I and Type II Errors
5.4. Variable Importance Analysis
5.5. Robustness Evaluation of Model Performance
5.5.1. Tuning Parameters
5.5.2. Validation with 70/30 Data Split
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Category | Variable | Definition | Type | Categories/Intervals |
---|---|---|---|---|
Claim factors | Individual Claims | The monetary amount claimed in insurance, based on the type of the claim. | Continuous | [5904, 15,900,000] |
Claim Type | The nature of the claim based on disability or fatality. | Categorical (ordinal) | General->1/Temporary disability->2/Partial permanent disability->3/Total permanent disability->4/Death->5 | |
Frequency | The number of claims made by the insured. | Numeric (discrete) | 1, 2, 3, 4, 5 | |
Deferred Period | The waiting period (years) before claims are paid. | Numeric (discrete) | 0, 1, 2, 3, 4, 5 | |
Vehicle factors | Vehicle Type | The classification of the insured vehicle | Categorical (ordinal) | Motorcycles->1/Auto->2/Light trucks->3/Vans, Minibuses->4/Trucks, Buses->5 |
Vehicle Brand | The Brand of the insured vehicle. | Categorical (nominal) | Mercedes Benz->19/Volkswagen->18/Ford->17/Audi->16/BMW->15/Opel->14/DaimlerChrysler->13/Fiat->12/Peugeot->11/Toyota->10/Suzuki->9/Volvo->8/Mitsubishi->7/Renault->6/Land Rover->5/Nissan->4/Iveco->3/Citroen->2/Seat->1/Others->0 | |
Vehicle Age | The number of years since the vehicle was manufactured. | Continuous | [1, 43] | |
Insured-related factors | Driver Age | The age of the insured driver. | Continuous | [16, 86] |
Region | The geographical area where the policyholder resides. | Categorical (nominal) | Center->1/South->2/Nord->3 | |
Gender | The gender of the insured driver. | Categorical (nominal) | Male->1/Female->0 |
Variable | Category | High | Low | Total |
---|---|---|---|---|
Claim Type | 1 | 12 (1.5%) | 5 (0.62%) | 17 (2.12%) |
2 | 347 (43.38%) | 223 (27.88%) | 570 (71.25%) | |
3 | 19 (2.38%) | 130 (16.25%) | 149 (18.62%) | |
4 | 0 (0%) | 4 (0.5%) | 4 (0.5%) | |
5 | 1 (0.12%) | 59 (7.38%) | 60 (7.5%) | |
Vehicle Type | 1 | 11 (1.38%) | 16 (2%) | 27 (3.38%) |
2 | 306 (38.25%) | 334 (41.75%) | 640 (80%) | |
3 | 27 (3.38%) | 30 (3.75%) | 57 (7.12%) | |
4 | 17 (2.12%) | 24 (3%) | 41 (5.12%) | |
5 | 18 (2.25%) | 17 (2.12%) | 35 (4.38%) | |
Region | 1 | 212 (26.5%) | 222 (27.75%) | 434 (54.25%) |
2 | 89 (11.12%) | 98 (12.25%) | 187 (23.38%) | |
3 | 78 (9.75%) | 101 (12.62%) | 179 (22.38%) | |
Gender | 0 | 80 (10%) | 84 (10.5%) | 164 (20.5%) |
1 | 299 (37.38%) | 337 (42.12%) | 638 (79.5%) | |
Frequency | 1 | 294 (36.75%) | 295 (36.88%) | 589 (73.62%) |
2 | 62 (7.75%) | 60 (7.5%) | 122 (15.25%) | |
3 | 16 (2%) | 36 (4.5%) | 52 (6.5%) | |
4 | 4 (0.5%) | 22 (2.75%) | 26 (3.25%) | |
5 | 3 (0.38%) | 8 (1%) | 11 (1.38%) | |
Deferred Period | 0 | 21 (2.62%) | 9 (1.12%) | 30 (3.75%) |
1 | 147 (18.38%) | 90 (11.25%) | 237 (29.62%) | |
2 | 154 (19%) | 202 (25.25%) | 354 (44.25%) | |
3 | 39 (4.88%) | 72 (9%) | 111 (13.88%) | |
4 | 12 (1.5%) | 37 (4.62%) | 49 (6.12%) | |
5 | 8 (1%) | 11 (1.38%) | 19 (2.38%) |
References
- Rawat, S.; Rawat, A.; Kumar, D.; Sabitha, A.S. Application of Machine Learning and Data Visualization Techniques for Decision Support in the Insurance Sector. Int. J. Inf. Manag. Data Insights 2021, 1, 100012. [Google Scholar] [CrossRef]
- Poufinas, T.; Gogas, P.; Papadimitriou, T.; Zaganidis, E. Machine Learning in Forecasting Motor Insurance Claims. Risks 2023, 11, 164. [Google Scholar] [CrossRef]
- Prodanov, S. Indemnification of non-material damages caused by road traffic accidents—Ethical and financial aspects. Econ. Arch. 2017, 4, 3–14. Available online: https://ideas.repec.org/a/dat/earchi/y2017i4p3-14.html (accessed on 2 April 2025).
- Wiedemann, M.; John, D. A practitioners approach to individual claims models for bodily injury claims in German non-life insurance. Z. Gesamte Versicherungswiss. 2021, 110, 225–254. [Google Scholar] [CrossRef]
- Weerasinghe, K.P.M.L.P.; Wijegunasekara, M.C. A comparative study of data mining algorithms in the prediction of auto insurance claims. Eur. Int. J. Sci. Technol. 2016, 5, 47–54. Available online: https://eijst.org.uk/files/images/frontimages/gallery/vol._5_no._1/6._47-54.pdf (accessed on 10 March 2025).
- Hanafy, M.; Ming, R. Classification of the Insureds Using Integrated Machine Learning Algorithms: A Comparative Study. Appl. Artif. Intell. AAI 2022, 36, 1–32. [Google Scholar] [CrossRef]
- Alamir, E.; Urgessa, T.; Hunegnaw, A.; Gopikrishna, T. Motor Insurance Claim Status Prediction Using Machine Learning Techniques. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 3. [Google Scholar] [CrossRef]
- Brati, E.; Braimllari, A. Review of Statistical and Machine Learning Methods Applied in Private Health Insurance. Albanian J. Econ. Bus. 2024, 41, 66–80. Available online: https://feut.edu.al/konferenca-botime/revista-shkencore/2096-albanian-journal-of-economy-and-business (accessed on 16 March 2025).
- Ellili, N.; Nobanee, H.; Alsaiari, L.; Shanti, H.; Hillebrand, B.; Hassanain, N.; Elfout, L. The Applications of Big Data in the Insurance Industry: A Bibliometric and Systematic Review of Relevant Literature. J. Finance Data Sci. 2023, 9, 100102. [Google Scholar] [CrossRef]
- Clemente, C.; Guerreiro, G.R.; Bravo, J.M. Modelling Motor Insurance Claim Frequency and Severity Using Gradient Boosting. Risks 2023, 11, 163. [Google Scholar] [CrossRef]
- Permai, S.D.; Herdianto, K. Prediction of Health Insurance Claims Using Logistic Regression and XGBoost Methods. Procedia Comput. Sci. 2023, 227, 1012–1019. [Google Scholar] [CrossRef]
- Brati, E.; Braimllari, A. Application of Bootstrap and Deterministic Methods for Reserving Claims in Private Health Insurance. Int. J. Math. Trends Technol. 2023, 69, 17–26. [Google Scholar] [CrossRef]
- Brati, E.; Braimllari, A. A Comparative Analysis of Stochastic Approaches for Claims Reserving in Private Health Insurance. WSEAS Trans. Bus. Econ. 2025, 22, 130–143. [Google Scholar] [CrossRef]
- Orji, U.; Ukwandu, E. Machine Learning for an Explainable Cost Prediction of Medical Insurance. Mach. Learn. Appl. 2024, 15, 100516. [Google Scholar] [CrossRef]
- Vinora, A.; Surya, V.; Lloyds, E.; Kathir Pandian, B.; Deborah, R.N.; Gobinath, A. An Efficient Health Insurance Prediction System Using Machine Learning. In Proceedings of the 2023 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 14–15 December 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Maisog, J.M.; Li, W.; Xu, Y.; Hurley, B.; Shah, H.; Lemberg, R.; Gutfraind, A. Using Massive Health Insurance Claims Data to Predict Very High-Cost Claimants: A Machine Learning Approach. arXiv 2019, arXiv:1912.13032. [Google Scholar] [CrossRef]
- Langenberger, B.; Schulte, T.; Groene, O. The application of machine learning to predict high-cost patients: A performance comparison of different models using healthcare claims data. PLoS ONE 2023, 18, e0279540. [Google Scholar] [CrossRef]
- Grize, Y.-L.; Fischer, W.; Lützelschwab, C. Machine Learning Applications in Nonlife Insurance. Appl. Stoch. Models Bus. Ind. 2020, 36, 523–537. [Google Scholar] [CrossRef]
- Alomair, G. Predictive Performance of Count Regression Models Versus Machine Learning Techniques: A Comparative Analysis Using an Automobile Insurance Claims Frequency Dataset. PLoS ONE 2024, 19, e0314975. [Google Scholar] [CrossRef]
- Hanafy, M.; Ming, R. Machine Learning Approaches for Auto Insurance Big Data. Risks 2021, 9, 42. [Google Scholar] [CrossRef]
- Nabrawi, E.; Alanazi, A. Fraud Detection in Healthcare Insurance Claims Using Machine Learning. Risks 2023, 11, 160. [Google Scholar] [CrossRef]
- Mavundla, K.; Thakur, S.; Adetiba, E.; Abayomi, A. Predicting Cross-Selling Health Insurance Products Using Machine-Learning Techniques. J. Comput. Inf. Syst. 2024, 1–18. [Google Scholar] [CrossRef]
- Yego, N.K.K.; Nkurunziza, J.; Kasozi, J. Predicting Health Insurance Uptake in Kenya Using Random Forest: An Analysis of Socio-Economic and Demographic Factors. PLoS ONE 2023, 18, e0294166. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y. Predictive Machine Learning for Underwriting Life and Health Insurance. In Proceedings of the Actuarial Society of South Africa’s 2021 Virtual Convention, Virtual, 19–22 October 2021; Available online: https://www.actuarialsociety.org.za/convention/wp-content/uploads/2021/10/2021-ASSA-Wang-FIN-reduced.pdf (accessed on 15 January 2025).
- Taha, A.; Cosgrave, B.; Mckeever, S. Using Feature Selection with Machine Learning for Generation of Insurance Insights. Appl. Sci. 2022, 12, 3209. [Google Scholar] [CrossRef]
- Adnan Aslam, M.; Murtaza, F.; Ehatisham Ul Haq, M.; Yasin, A.; Ali, N. SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning. Data 2025, 10, 27. [Google Scholar] [CrossRef]
- Breskuvienė, D.; Dzemyda, G. Categorical Feature Encoding Techniques for Improved Classifier Performance When Dealing with Imbalanced Data of Fraudulent Transactions. Int. J. Comput. Commun. Control 2023, 18, 3. [Google Scholar] [CrossRef]
- Lunardon, N.; Menardi, G.; Torelli, N. ROSE: A Package for Binary Imbalanced Learning. R J. 2014, 6, 79. [Google Scholar] [CrossRef]
- Lee, C.-W.; Fu, M.-W.; Wang, C.-C.; Azis, M.I. Evaluating Machine Learning Algorithms for Financial Fraud Detection: Insights from Indonesia. Mathematics 2025, 13, 600. [Google Scholar] [CrossRef]
- Dhamo, Z.; Gjeçi, A.; Zibri, A.; Prendi, X. Business Distress Prediction in Albania: An Analysis of Classification Methods. J. Risk Financ. Manag. 2025, 18, 118. [Google Scholar] [CrossRef]
- AbdElminaam, D.S.; Farouk, M.; Shaker, N.; Elrashidy, O.; Elazab, R. An Efficient Framework for Predicting Medical Insurance Costs Using Machine Learning. J. Comput. Commun. 2024, 3, 55–64. [Google Scholar] [CrossRef]
- Therneau, T.; Atkinson, B.; Ripley, B.; Venables, W.N.; Liaw, A.; Wiener, M.; Chen, T.; He, T.; Benesty, M.; Tang, Y.; et al. R Packages Used for Classification Modeling: Rpart, randomForest, xgboost, e1071, and Class; R Foundation for Statistical Computing: Vienna, Austria, 2023; Available online: https://cran.r-project.org (accessed on 10 November 2024).
- Liu, C.-J.; Huang, T.-S.; Ho, P.-T.; Huang, J.-C.; Hsieh, C.-T. Correction: Machine Learning-Based E-Commerce Platform Repurchase Customer Prediction Model. PLoS ONE 2024, 19, e0315518. [Google Scholar] [CrossRef]
- Ala’raj, M.; Abbod, M.; Radi, M. The Applicability of Credit Scoring Models in Emerging Economies: An Evidence from Jordan. Int. J. Islam. Middle East Financ. Manag. 2018, 11, 608–630. [Google Scholar] [CrossRef]
- Rajput, D.; Wang, W.J.; Chen, C.C. Evaluation of a Decided Sample Size in Machine Learning Applications. BMC Bioinform. 2023, 24, 48. [Google Scholar] [CrossRef] [PubMed]
- Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Machine Learning Algorithm Validation with a Limited Sample Size. PLoS ONE 2019, 14, e0224365. [Google Scholar] [CrossRef] [PubMed]
Study | Purpose | Algorithms | Dataset | Performance Metrics | Key Findings |
---|---|---|---|---|---|
[16] | High-cost claim prediction | LightGBM XGBoost | U.S. health Insurance data with 48 million observations (2017–2019) | Accuracy Precision F1-score Recall | LightGBM best performer; key predictors: age, rising cost, life expectancy |
[6] | Insurance claim classification | Random Forest Decision Tree K-NN Logistic Regression | Three auto insurance datasets from Kaggle repository | AUC | RF achieved highest classification performance |
[19] | Auto claim frequency prediction | SVM Poisson, Negative Binomial, Zero-Inflated Poisson, | Automobile insurance from SAS Enterprise Miner. (Dataset encompasses 10,303 observations and 33 variables.) | (Mean Absolute Error) | SVM outperformed others; improved risk pricing and estimation |
[20] | Auto insurance for predicting claim occurrence. | Logistic Reg., XGBoost, RF, DT, NB, K-NN | Brazilian automotive data from Kaggle repository. (1.48 M observations, 59 features) | Accuracy AUC | RF achieved best results among tested models |
[10] | Modeling motor insurance claim frequency and severity | Gradient Boosting (GB), Generalized Linear Models (GLMs) | European auto insurer (The dataset included 2,464,181 observation and 21 feature, during the period 2016–2019) | Friedman’s H-statistic | GB better on frequency, GLM better on severity; ML enables better risk management |
[1] | Fraud detection | RF,DT,SVM,K-NN, Logistic Reg. Gaussian Naïve Bayes Bernoulli Naïve Bayes Mixed Naïve Bayes | Kaggle dataset (1338 records, 9 variables) | Accuracy: Precision Recall F1-Score AUC | RF most efficient for fraud detection |
[21] | Health fraud detection | Random Forest, Logistic Regression, ANN | Saudi Arabia health insurance data. (Dataset included 396 observations from January 2022 to May 2022). | Accuracy Precision Recall F1-Score | RF outperformed other classifiers. Policy type, education, and age were identified as the most significant features. |
[11] | Insurance fraud prediction | XGBoost, Logistic Regression | Indonesian private insurer data (Dataset includes 11,882 observations and 19 features). | Accuracy Precision Recall | XGBoost outperformed Logistic Regression |
[18] | Application of ML in non-life insurance, focusing on Premium pricing. | XGBoost, Decision Tree, Neural Network | Data were obtained from 20 competitors over a 12-month period, consisting of about 30,000 observations and 70 features. | (Mean Absolute Percentage Error) | XGBoost had best performance in premium prediction. |
[22] | Cross-selling behavior for health insurance | RF, K-NN, XGBoost, Logistic Regression | South African health insurance dataset (1 M records with 16 features) | Accuracy Precision Recall F1 Score Support | RF identified strong predictors; achieved high predictive accuracy |
[23] | Predict health insurance uptake behavior | Random Forest, XGBoost, Logistic Regression. | Kenya FinAccess Survey Data in 2021. (The dataset includes 22,024 records with 23 features) | Accuracy Recall, F1 Score AUC | RF outperformed other models in prediction accuracy. |
[24] | Underwriting optimization | XGBoost, Random Forest, Bagging, K-NN Gradient Boosting, SVM, Decision Tree, AdaBoost, Logistic Regression | Reinsurer’s life/health data during the period January 2017 to June 2020 (Dataset contains 29,317 observations and 37 variables.) | Accuracy Precision Recall F1-score | XGBoost outperformed other methods for underwriting |
[25] | Feature selection in ML and noise reduction in improved model performance. | SVM and K-NN for classification tasks | Five public insurance datasets obtained from Kaggle machine learning repository. | Accuracy | These findings highlight the dual benefit of feature selection in enhancing model accuracy and offering interpretability and strategic value to insurers. |
Variable | Mean | Standard Deviation | Min | Max |
---|---|---|---|---|
Claims | 1,319,104.8 | 1,964,988.5 | 5904.48 | 15,900,000 |
Frequency | 1.44 | 0.86 | 1 | 5 |
Driver Age | 43.32 | 14.25 | 16 | 86 |
Vehicle Age | 18.66 | 6.44 | 1 | 43 |
Deferred Period | 1.96 | 1.02 | 0 | 5 |
Predicted (Low) | Predicted (High) | |
---|---|---|
Actual (Low) | True Negative (TN) | False Positive (FP) (Type I error) |
Actual (High) | False Negative (FN) (Type II error) | True Positive (TP) |
ML Model | CA | AUC |
---|---|---|
Random Forest | 0.8867 | 0.9437 |
XGBoost | 0.84 | 0.9179 |
SVM | 0.7467 | 0.8416 |
Naïve Bayes | 0.7105 | 0.8703 |
Logistic Regression | 0.7599 | 0.7978 |
Decision Tree | 0.7467 | 0.8258 |
K-Nearest Neighbors | 0.8618 | 0.8621 |
ML Model | Type I Error | Type II Error |
---|---|---|
Random Forest | 8.57% | 13.75% |
XGBoost | 16% | 16% |
SVM | 27.16% | 23.19% |
Naïve Bayes | 35.51% | 13.33% |
Logistic Regression | 28.38% | 28.95% |
Decision Tree | 22.39% | 27.71% |
K-Nearest Neighbors | 15.38% | 12.16% |
Model | Parameters | Range | Optimal Value |
---|---|---|---|
Random Forest |
| [3, 5, 9] | 10 |
XGBoost |
| [0.01, 0.1, 0.2] [3, 6, 9] [0.5, 0.7, 1] [0.5, 0.7, 1] [100, 200] [0 to 1] | 0.2 9 0.7 0.7 100 0 |
Decision Tree |
| 0 to 0.1 | 0.004 |
SVM |
| [Linear, Radian, Polynomial] [0.1, 1, 10, 100] [0.001, 0.01, 0.1, 1] | Polynomial 1 0.1 |
K-NN |
| [1 to 10] | 1 |
Naïve Bayes |
| [0 to 1] [0 to 1] [ALSE, TRUE] | 0 0.5 TRUE |
ML Model | CA | AUC |
---|---|---|
Random Forest | 0.8451 | 0.9163 |
XGBoost | 0.8584 | 0.8935 |
SVM | 0.7434 | 0.8189 |
Naïve Bayes | 0.7281 | 0.8651 |
Logistic Regression | 0.7599 | 0.7982 |
Decision Tree | 0.7389 | 0.7882 |
K-nearest neighbors | 0.8158 | 0.8158 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Brati, E.; Braimllari, A.; Gjeçi, A. Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data. Data 2025, 10, 90. https://doi.org/10.3390/data10060090
Brati E, Braimllari A, Gjeçi A. Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data. Data. 2025; 10(6):90. https://doi.org/10.3390/data10060090
Chicago/Turabian StyleBrati, Esmeralda, Alma Braimllari, and Ardit Gjeçi. 2025. "Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data" Data 10, no. 6: 90. https://doi.org/10.3390/data10060090
APA StyleBrati, E., Braimllari, A., & Gjeçi, A. (2025). Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data. Data, 10(6), 90. https://doi.org/10.3390/data10060090