Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data

Brati, Esmeralda; Braimllari, Alma; Gjeçi, Ardit

doi:10.3390/data10060090

Open AccessArticle

Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data

by

Esmeralda Brati

^1,*,

Alma Braimllari

¹

and

Ardit Gjeçi

²

¹

Department of Statistics and Applied Informatics, Faculty of Economy, University of Tirana, 1010 Tirana, Albania

²

Department of Economics and Finance, University of New York Tirana, 1000 Tirana, Albania

^*

Author to whom correspondence should be addressed.

Data 2025, 10(6), 90; https://doi.org/10.3390/data10060090

Submission received: 25 April 2025 / Revised: 2 June 2025 / Accepted: 14 June 2025 / Published: 17 June 2025

(This article belongs to the Topic Applications of Algorithms in Risk Assessment and Evaluation)

Download

Browse Figures

Versions Notes

Abstract

Insurance is essential for financial risk protection, but claim management is complex and requires accurate classification and forecasting strategies. This study aimed to empirically evaluate the performance of classification algorithms, including Logistic Regression, Decision Tree, Random Forest, XGBoost, K-Nearest Neighbors, Support Vector Machine, and Naïve Bayes to predict high insurance claims. The research analyses the variables of claims, vehicles, and insured parties that influence the classification of high-cost claims. This investigation utilizes a dataset comprising 802 observations of bodily injury claims from the motor liability portfolio of a private insurance company in Albania, covering the period from 2018 to 2024. In order to evaluate and compare the performance of the models, we employed evaluation criteria, including classification accuracy (CA), area under the curve (AUC), confusion matrix, and error rates. We found that Random Forest performs better, achieving the highest classification accuracy (CA = 0.8867, AUC = 0.9437) with the lowest error rates, followed by the XGBoost model. At the same time, logistic regression demonstrated the weakest performance. Key predictive factors in high claim classification include claim type, deferred period, vehicle brand and age of driver. These findings highlight the potential of machine learning models in improving claim classification and risk assessment and refine underwriting policy.

Keywords:

insurance claim; classification; machine learning algorithms; variable importance; confusion matrix

1. Introduction

Insurance operates as a financial protection against potential risks through an agreement between insurers and their customers. Insurance plays a crucial role in the protection of individuals, along with businesses, and makes a significant contribution to economic stability [1]. A claim event is an incident that prompts the insured to file a claim. Insurance companies need sufficient funds to pay all existing policies and future claims from active policies [2]. Damages from road traffic accidents involving bodily injury are awarded according to both national and international regulations, such as the Green Card and the European Union (EU) Motor Insurance Directives.

According to compulsory motor third-party liability insurance, insurers cover the costs of body injuries or deaths, as well as other forms of expected damages, such as pain, suffering, and emotional distress [3]. Bodily injury claims need to be modeled carefully in motor liability insurance since these claims usually represent the largest share of claims provisions and significantly influence insurers’ financial reserves. Creating individual claims models helps improve the accuracy of analysis by providing actuaries with an in-depth understanding of the claims portfolio and helping to manage risks [4]. Identifying early the risk factors that lead to high-cost claims is necessary for improving profits and holding stable premiums in the insurance industry. Claims can differ greatly in severity when reported; therefore, predictive modeling through classification techniques is practical for estimating claims and supporting decision-making in claims management [5].

In Insurance companies, it is important to classify policyholders according to risk to set a fair premium pricing since each client faces a specific risk amount. This process of classifying risks into categories is important and difficult because it aims to ensure insurers make a profit and that prices are fair for similar groups with the same chances of similar future losses [6].

Rapid data growth from technology has made traditional handling methods inadequate. The industry has transformed through big data and technology, enabling companies to process vast amounts of data more efficiently. In particular, machine learning (ML) methods have shown potential in identifying how variables at the individual level relate to each other and can produce accurate and detailed predictions [7]. The continuous regulatory modifications and market dynamics of insurance prompt insurance companies to implement machine learning methods to enhance their operational performance. Recent research shows an increasing application of machine learning methodologies in the insurance industry, especially in auto and health insurance [8,9,10,11].

This study adopts a novel approach by using individual-level data to improve early risk detection, in contrast to the widely used traditional actuarial approaches such as the deterministic and stochastic methods for aggregate claim reserving [12,13]. In this way, we contribute to the growing field of individualized risk modeling, which plays a critical role in modern insurance analytics. In this study, ML techniques are applied to classify different bodily injury claims into various risk levels based on claim attributes and policyholders’ information. The main purpose is to improve how insurance companies price and underwrite policies through advanced techniques. Some studies confirm that ML algorithms produce reliable predictions in high-cost claims in insurance, enabling insurers to better anticipate and manage financial risks [14,15,16,17].

Despite the increase in studies on machine learning applications in insurance, there remains a significant gap regarding analyses focused on specific regional markets.

This study is based on data obtained from the database of an Albanian insurance company covering the period from 2018 to 2024. The dataset consists of 802 records of paid bodily injury claims from the motor liability portfolio and nine related features. To the best of our knowledge, this study is the first to apply machine learning classification models to insurance claims data from the Albanian market.

A key contribution of this research lies in its exclusive focus on bodily injury claims within the motor liability portfolio, utilizing a unique dataset from private insurers. In contrast to previous studies that concentrate on total motor insurance losses, this analysis is limited to bodily injury claims. These claims are associated with higher costs in medical treatment, long-term care, and compensation for pain and suffering; they are generally considered more complex and have a greater impact on finances [4]. As a result, they are marked as high-risk categories in both underwriting and reserving. Focusing exclusively on bodily injury claims allows the researchers to create a more accurate model and provide detailed ideas on risk classification and decisions in motor liability insurance.

Furthermore, implementing machine learning methods for insurance claim classification provides multiple valuable contributions to the existing literature in several ways. Firstly, this study increases the understanding of risk assessment and claim management in the insurance sector by identifying the key factors affecting the prediction of high-value claims. Secondly, it advances the use of machine learning techniques in the insurance sector by demonstrating how ML algorithms improve claim classification accuracy, addressing traditional actuarial methods’ limitations. Finally, this study evaluates and compares multiple classification methods, including Random Forest (RF), Extreme Gradient Boosting (XGBoost), Decision Tree (DT), Support Vector Machine (SVM), K-Nearest Neighbors (K-NN), Naïve Bayes (NB), and Logistic Regression (LR), contributing to methodological advancements in insurance analytics. The proposed classification framework for insurance claims has been designed to answer these research questions:

Which factors affect the classification of insurance claims?
Which of the machine learning models performs best in classifying insurance claims?

The structure of the paper is organized as follows: Section 2 reviews recent studies on the application of machine learning algorithms in the insurance sector. Section 3 provides an overview of the data preprocessing methods and the dataset used in this study. In Section 4, the various machine learning methods employed in this analysis are described, along with the metrics that are used for evaluating the classification models. Section 5 presents the main findings, including the performance of the proposed models, and highlights feature importance. The discussions of findings and conclusions with recommendations for future research are presented in Section 6 and Section 7.

2. Literature Review

The recent literature highlights the importance of applying machine learning models in insurance, particularly in claims analysis, fraud detection, risk assessment, and premium optimization. These studies are mostly focused on optimizing algorithms and selecting key features to improve efficiency and reduce claim processing costs in auto and health insurance [8,18].

2.1. Claim Analysis

Several studies utilized diverse datasets and machine learning models to enhance predictive accuracy in claims processing and analysis, such as [16], which established a model to predict high-cost claims using U.S. health insurance claims datasets from 2017 to 2019. LightGBM (Light Gradient Boosting Machine) achieved 91.2% accuracy as the best performer among predictive factors, including age, rising costs, and life expectancy. Furthermore, ref. [6] focused on optimizing claims analysis in the insurance industry through the application of ML algorithms, using a dataset obtained from Kaggle.com, and the authors concluded that random forests achieved successful results in insurance classification tasks. The article [19] employed machine learning technology to enhance automobile insurance claim frequency forecasting accuracy using the SAS Enterprise Miner database (version 15.1) containing extreme values. The SVM model demonstrated superior performance compared with other models by providing important information about risk estimations and pricing models. In their study [20], the authors used a dataset from a big Brazilian automotive company consisting of 1,488,028 observations alongside 59 variables to analyze machine learning applications in auto insurance claims processing. They compared several ML models, including Logistic Regression, XGBoost, Random Forest, Decision Trees, Naïve Bayes, and K-NN, to predict the occurrence of claims. They found that the random forest model produced the best empirical results in their study, with an accuracy of 86.77% and an area under the curve (AUC) of 0.840. Moreover, ref. [10] used a dataset from a European insurance firm from 1 January 2016 to 31 December 2019 to model auto insurance claims through gradient boosting (GB) and generalized linear models (GLMs). The study shows that GB model performs better than GLM when it comes to claim frequency, though GLM is more accurate regarding claim severity. Their findings demonstrate the leverage insurers can have over risk management with the adoption of ML models due to their intricate nonlinear relationship capture abilities and more advanced data-driven decisions enabled through ML models.

2.2. Fraud Detection

Various studies have investigated fraud detection in insurance using machine learning algorithms. The study [1] applied the ML models to detect fraud on a database obtained from Kaggle.com, which consists of 1338 records and 9 variables. They used exploratory data analysis and feature selection to enhance claims analysis. They concluded that the random forest model was the most efficient algorithm for detecting fraud, with an accuracy of 0.99. Similarly, ref. [21] used Random Forest, Logistic Regression, and artificial neural networks to detect healthcare insurance fraud in Saudi Arabia. Their results showed that random forest outperformed all other classifiers in fraud detection with an accuracy of 98.21%. In addition, ref. [11] studied predictive models for fraud detection using data from a private insurance company in Indonesia, where XGBoost outperformed logistic regression in the classification accuracy task.

2.3. Premium Pricing and Sales Optimization

According to the study [18], the authors employed ML algorithms to predict insurance premiums, using data obtained from more than 20 insurance firms over a 12-month period. They concluded that XGBoost outperformed Decision Tree and Neural Networks (NN) in insurance premium prediction, thereby confirming the relevance of machine learning in the industry. In addition, ref. [22] employed ML models such as RF, K-NN, XGBoost, and Logistic Regression to analyze health insurance cross-selling behavior among South African consumers using a dataset of 1,000,000 records. The authors concluded that Random Forest achieved 0.99 accuracy and an F1 score of 1.000, identifying key predictors such as age, prior insurance, and longer service history. According to the study [23], the authors investigated factors influencing the uptake of health insurance in Kenya using data from the 2021 FinAccess Survey. High performance was determined by using machine learning techniques, especially the Random Forest model, with an accuracy of 96.36%. On the other hand, ref. [24] employed machine learning to enhance underwriting choices for life and health insurance, especially for high insured sums, using the reinsurer’s dataset from January 2017 to June 2020. The study showed that models included extreme gradient boosting, with 94% accuracy on the training set and 71% on the testing set, outperforming traditional methods. The study [25] proposed a framework for developing feature selection to reduce the effects of noise in machine learning implementation in insurance based on five publicly available real insurance datasets. The study enhanced clustering and classification outcomes by excluding features associated with low interclass variance and suggested that using less than 20% or 50% of the features still improved or maintained the model performance compared to using the full set of features. Table 1 provides a structured summary of the studies reviewed in the literature.

3. Data

The data used in this study were collected from bodily injury claims within the motor liability portfolio of private insurance companies in Albania. The dataset consists of 802 individual observations, including claims, vehicle, and insured information, retrieved during the period 2018–2024. The dataset includes only reported and paid claims, while insurance policies without any claims have been excluded to focus the analysis on actual claim events. This approach ensures that the analysis is focused specifically on filed claims’ characteristics. The dependent variable, claims, refers to the monetary amount of each claim at the time it was first filed. This variable shows the cost per claim and does not represent the total money spent for several claims over time. This study includes variables derived from claims and policy records, which are known in the literature as relevant factors for assessing the risk and predicting the severity and classification of insurance claims. The information on all variables used in this study, including their definitions, coding, types and labels or intervals, is provided in detail in Table A1 in Appendix A.

3.1. Exploratory Data Analysis

Exploratory data analysis (EDA) is one of the initial steps in the machine learning lifecycle and helps understand and manage the data set. EDA in this study consists of an assessment of the type of variables, descriptive analysis, and distribution analysis. Table A1 in Appendix A provides an overview of the structure of the extracted dataset. The target variable in this study is individual claims. In contrast, the nine independent variables consist of claim type, frequency, deferred period, vehicle type, vehicle brand, vehicle age, driver age, region, and gender. The claim amounts have the right skewness, which requires a natural logarithm transformation to improve modeling accuracy. The k-means clustering technique was applied to classify ln (claims) in two classes, and values above ln(600,000) = 13.3 were classified as high claims and amounts less than ln(600,000) as low claims. The mean, standard deviation, minimum, and maximum values for numeric variables included in the analysis are presented in Table 2.

3.2. Data Encoding

Numerical data are a prerequisite for several machine learning model algorithms. Therefore, feature encoding methods must be applied. There are two main types of feature encoding: nominal for unordered data and ordinal for ordered data. In this study, we had ordinal and nominal attributes; thus, we applied several encoding methods to categorize the attributes into numeric values for suitable analysis [26]. Our target variable, ln(claims), after separation with k-means clustering methods, was a binary variable that contains two classes: 1 if it is a high claim and 0 if it is a low claim. Ordinal attributes have a meaningful order or ranking among the categories, while nominal attributes do not have a natural order or ranking. We applied ordinal encoding to the ordinal attributes, such as claim type and vehicle type, which assigns a unique integer to each category according to its rank or order [27]. For example, we encoded the variable claim type, “General” as 1, “Temporary disability” as 2, “Partial permanent disability” as 3, “Total permanent disability” as 4 and Death as 5. Similarly, for the vehicle type variable, we encoded Motorcycles as 1, Auto as 2, Light trucks as 3, Vans, Minibuses as 4, and Trucks and Buses as 5. This approach allows us to preserve the ordinal attributes and analyze them. On the other hand, the encoding process included label encoding for nominal variables without a specific sequencing order. For example, we encoded the vehicle brand variable using label encoding, assigning a numeric value for each brand, as shown in Table A2 in Appendix A. One-hot encoding was used to transform the region variable, while dummy coding was used for the gender variable.

3.3. Data Transformation

An important aspect of feature preprocessing is normalization, which aims to scale the numerical values uniformly between 0 and 1. This makes all features contribute equally towards the model and avoid dominance by some features, which can reduce model performance. Normalization has a standard formula to adjust feature values and make them more suitable for machine learning [6]. The formula used for normalization is the following:

X_{n o r m} = \frac{X - X_{m i n}}{X_{m a x} - X_{m i n}}

(1)

The continuous variables in our dataset are the driver’s and vehicle’s age, which are normalized based on Formula (1). This method was especially employed because the variable age of the driver did not have a normal distribution. The dataset is more appropriate for machine learning algorithms through these transformations and normalization steps.

3.4. Imbalanced Dataset

Balancing class distribution is crucial for predictive modeling. Some approaches, like SMOTE (Synthetic Minority Over-sampling Technique) and ROSE (Random Over-Sampling Examples), try to mitigate bias by synthesizing additional samples, which subsequently helps in improving the prediction and robustness of the models. Balanced samples are created through random oversampling of minority instances, undersampling of majority instances, or a mix of oversampling and undersampling [23,28]. For instance, in [6], the author applied the ROSE technique to the unbalanced data to help enhance the model’s performance. Class imbalance poses a significant challenge for machine learning algorithms, as they do not operate optimally with imbalanced data. Therefore, in the study, both over-sampling and under-sampling techniques were employed to tackle this issue problem. After applying the ROSE technique, we achieved a balanced sample pool with 379 observations from high and low claim classes, respectively.

3.5. Training and Testing

Training consists of building machine learning models with a portion of the data so that the models can learn to recognize certain patterns and make predictions on the remaining data. Testing evaluates how well a model performs on different subsets of data which was not used in training in order to evaluate generalization capabilities. The classifiers were built using the selected features obtained from the feature selection methods [23]. The dataset was divided into two parts to determine the model’s performance, with 80% used for training and 20% of the dataset for testing.

4. Methods

Insurance claims are classified as high or low categories using a binary classification framework to identify those with significant impact on insurance reserves. This approach helps insurers improve resource allocation and risk management. Regression models are better for continuous variables, like claim amount, but this research focuses on classification models for addressing challenges found in claims management and identifying risks early. This study investigated the effectiveness of various classifier models in categorizing claims as high or low. A range of classifiers was employed to evaluate how different variables influence high claim predictions, including logistic regression, Decision Trees, Random Forests, XGBoost, K-Nearest Neighbors (K-NN), Naïve Bayes, and Support Vector Machines (SVMs).

4.1. Logistic Regression

Logistic regression is a popular model due to its simplicity, explainability, and low computational requirements. It is primarily used for binary classification, explaining the relationship between a binary dependent variable and multiple independent variables [29]. Logistic regression achieves high accuracy in various applications, such as in financial distress business classification [30] and in the prediction of claim occurrence [20]. Two main disadvantages of the method are its reliance on predefined relationships among the variables and its struggles with missing data inputs. Some advantages include the ability to evaluate how well the model describes the data, assess the importance of explanatory variables, and assess the impact of each observation on the outcome [23].

4.2. Classification and Decision Tree (CART) Method

A Decision Tree serves as a supervised learning algorithm that handles both classification problems and regression problems. The algorithm presents data through hierarchical structures containing node elements and branching patterns so users can easily understand and interpret. The model divides data through repeated partitions using key features that provide maximum information [6]. A Decision Tree includes decision nodes that determine choices and leaf nodes that display outcomes to create its flowchart pattern. Input attributes connect to internal nodes, while branches show potential attribute values that lead to predictive leaf nodes. Decision Trees use dataset features to generate binary outputs such as “Yes” or “No” when performing a classification task. Regression tasks use decision trees by generating target predictions through value-averaging data points enclosed in one leaf node [30].

4.3. Random Forest

The Random Forest (RF) model is an ensemble machine learning technique designed to improve predictive accuracy and robustness by combining outputs from multiple Decision Trees. The training process generates several Decision Trees using bootstrapped samples, randomly selecting data from the original dataset with replacement. In classification tasks, Random Forest calculates class prediction from the majority vote of individual trees, while for regression tasks, it computes mean predictions [21]. The method shows particular effectiveness for datasets with many features and data with missing values or complex nonlinear patterns. The two main principles under which the Random Forest algorithm works are bagging and the random subspace method. Bagging uses bootstrap sampling to produce multiple data partitions, which are trained to build diverse trees that reduce prediction model variance. The algorithm extends this diversity feature by randomly sampling subsets of features at each decision tree split, which minimizes overfitting and correlations between trees [23].

4.4. XGBoost

The XGBoost supervised machine learning algorithm constructs several decision trees using boosting methods to perform classification tasks. XGBoost performs an initial prediction across all samples before it builds decision trees based on residual values. This approach adds new trees to minimize error prediction by finding partitions to maximum gain while comparing child node scores with their parent node scores [20]. The model improves its prediction accuracy on each iteration as its probability values are adjusted. The chosen hyperparameters of learning rate (shrinkage), gamma (minimum loss reduction), maximum tree depth, column sample (feature evaluation fraction), and subsample (data sampling rate) allow XGBoost to control the tree complexity. The evaluation process generates model performance scores to determine variable importance based on their frequency of use during the decision-making process. The reason behind XGBoost classification results being superior is that there is a way for users to control the complexity and importance of the features [21].

4.5. Support Vector Machine

The supervised machine learning algorithm SVM enables classification and regression tasks. Data transformation through a higher-dimensional space helps SVM identify an optimal boundary that separates different classes, like high claim and low claim, in our case [29]. The objective is to identify f(x) that minimizes deviation from the actual values while maintaining maximum flatness. SVM performance is determined by two essential parameters: margin of tolerance and cost parameter, which balance the flat function and deviation tolerance. SVM can be classified into two categories, which include linear and nonlinear models. Linear SVM works exclusively for data that can be divided into separate groups with a straight line. Nonlinear SVM utilizes the kernel trick to transform data into a higher-dimensional space for finding a separating hyperplane that separates nonlinear data. To obtain effective class separation in the transformed space, it is important to select an adequate kernel function—linear, polynomial, or radial basis function (RBF) [30].

4.6. K-Nearest Neighbors

The K-Nearest Neighbors method serves as a basic non-parametric supervised learning classifier that performs both classification and regression tasks. K-NN performs class prediction through neighbor-based majority votes in classification tasks and performs regression by calculating the mean or median of the nearest neighbors [31]. K-NN works effectively on datasets containing small and moderate numbers of records when the borders of classes follow simple patterns. The algorithm operates through memory-based and lazy learning methods to store training points and compare new data points and existing stored data. The selection of the parameter “k” is an important decision in implementing the K-NN, as it determines the number of nearest neighbors considered in the prediction process [6,29].

4.7. Naïve Bayes

The Naïve Bayes classifier works as a probabilistic method that uses Bayes’ Theorem, which requires feature independence. The algorithm delivers effective results across different applications, even though the data assumption commonly fails in real-world observations. During classification, Naïve Bayes performs two steps: first, it evaluates conditional probabilities of features across classes, then selects the class with the maximum probability value. Naïve Bayes is widely used in domains such as text classification due to its efficient computation [26,30].

All the analyses were carried out using the R software (version 4.3.2), where machine learning was handled through standard and widely used packages. Specifically, the “rpart” package was used for Decision Trees, “randomForest” for Random Forests, “XGboost” for Extreme Gradient Boosting, “e1071” for Support Vector Machines and Naive Bayes, and “class” for K-Nearest Neighbors. These packages were selected based on their robustness, used by many researchers in classification modeling and match the methodological approach used in this study [32].

4.8. Model Evaluation

The model evaluation process depends on metrics like accuracy and AUC, together with confusion matrices and type I and type II errors, which provide analysis of model reliability.

4.8.1. Confusion Matrix

The assessment of model performance relied on a confusion matrix by comparing predicted labels to actual labels in the test dataset. The confusion matrix is a 2 × 2 table that evaluates classification results by presenting true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs). High values of TP and TN demonstrate better performance of the model. The disparity between actual and predicted classifications is shown in the confusion matrix analysis of Table 3.

In the analysis, TN means true low cases predicted correctly, FN indicates high cases that have been misclassified as low, FP indicates low cases that have been misclassified as high, and TP indicates high cases correctly identified [6,11].

4.8.2. Accuracy

Accuracy evaluates the ratio of correct predictions to total classifications for both claim categories, as shown in Table 2.

A c c u r a c y = \frac{T P + T N}{T P + T N + F N + F P}

(2)

4.8.3. Area Under the Curve (AUC)

The Receiver Operating Characteristic (ROC) Curve is a graph that illustrates the True Positive Rate (the correctly predicted high claims) plotted against the False Positive Rate (claims incorrectly identified as high despite being low) used to assess the classification model’s performance. The AUC, or area under the ROC Curve, quantifies this performance. If the AUC exceeds 0.5, it indicates better model performance, whereas values below 0.5 reflect poor performance. An AUC of 1, however, indicates perfect classification [18,33].

4.8.4. Type I and Type II Errors

Insurance claims classification relies on evaluating Type I and Type II errors to assess misclassification performance beyond basic accuracy metrics. Identifying low claims as high claims constitutes a Type I error, which can result in financial losses for the insurance company [34].

A Type II error occurs when high claims are misclassified as low claims, leading to lost revenue. From a financial perspective, Type I errors are generally more costly than Type II errors.

Type I error = FP/(TN + FP);
Type II error = FN/(TP + FN).

A 2 × 2 confusion matrix identifies and analyzes both TP and TN scores and errors to evaluate model performance systematically.

4.8.5. Hyperparameter Tuning

Hyperparameter tuning is important in maximizing the efficiency of machine learning models. Grid Search Cross Validation (GridSearchCV) is used for each model to evaluate and select the optimal combination of hyperparameters. This approach improves generalization by systematically testing a specified range of parameter values and applying the best configuration [14]. The most relevant hyperparameter for the RF algorithm is mtry, which stands for the number of randomly chosen predictors. For the XGBoost model, the most important parameters are the learning rate (eta), maximum tree depth (max_depth), column sampling ratio (colsample_bytree), number of boosting rounds (nrounds), sample proportion (subsample), and minimum loss reduction (gamma). In the SVM algorithm, the key parameters include Kernel (the function used to transform the data, such as linear, polynomial, or RBF), Cost (the balance between the size of the allowed margin and penalty for misclassifications), and Gamma (the influence of each training point on the decision boundary). For the decision tree and K-NN algorithms, the parameters cp (complexity parameter) and K (the specified number of neighbors) are considered, respectively. Finally, the Naïve Bayes algorithm makes use of Laplace (Laplace correction for zero probabilities), adjust (bandwidth adjustment), and usekernel (distribution type) parameters [20].

5. Results

In this section, we present the results obtained from the classification models in Section 4 and show their predictive accuracy based on the evaluation metrics outlined in Section 4.8. Table 4 presents the results from all classified algorithms derived from this study with performance metrics such as CA (classification accuracy) and AUC. After the data preprocessing, we conducted a random split that used 80% of the data for training purposes and the remaining 20% for testing. The training data consisted of 608 insured individuals, while 150 insured individuals were in the validation data set.

5.1. Model Performance

The Random Forest model achieved the highest accuracy score of 0.8867, surpassing all other models, with an AUC score of 0.9437, as shown in Table 4. XGBoost placed second with an accuracy score of 0.84 and an AUC score of 0.9179. Naïve Bayes and K-NN models yielded similar results with an AUC metric of 0.8703 and 0.8621, respectively, and an accuracy of 0.8618 for K-NN and 0.7105 for Naïve Bayes. Two other models, SVM and Decision Tree, produced performance metrics with the same accuracy scores of 0.7467 along with AUC values of 0.8416 and 0.8258, respectively. On the other hand, the Logistic Regression model had the least effective performance in classification tasks, with an accuracy of CA = 0.7599 and AUC = 0.7978. We concluded that Random Forest provided the best results in classifying insurance claims into low and high-value categories with respect to CA and AUC metrics. These results coincide with previous studies [11,16], which confirm that Random Forest outperforms all other models in identifying high-value claims.

We evaluated model performance using the area under the receiver operating characteristic curve (AUC) shown in Figure 1. The performance of a model improves when AUC scores increase toward 1. Our results demonstrate that the random forest model achieved the highest AUC score of 0.94, followed by XGBoost with an AUC of 0.92. Random Forest outperformed all other models at classifying positive and negative outcomes, while logistic regression produced the lowest performance according to the results. Random Forest demonstrates optimal performance from AUC results, which makes it a suitable choice for high insurance claim forecasting.

5.2. Confusion Matrices Results

Figure 2 presents the model’s effectiveness in distinguishing two classes. The random forest model achieved the highest accuracy, correctly predicting 69 high and 64 low instances, resulting in a high true positive and negative rate. Despite misclassifying 11 high claims as low and six low claims as high, the model’s performance was still very high. The XGBoost model performed lower than the random forest model, with 63 high and 63 low instances of claim identification, but also misclassified 12 high claims as low and 12 low claims as high.

The K-NN and decision tree classifier also performs well, as shown in Figure 3, correctly identifying 65 and 60 high claims and 66 and 52 low claims, respectively. However, it has a few misclassifications: 12 and 15 low claims are misclassified as high, and 9 and 23 high claims are misclassified as low, respectively.

Furthermore, the SVM, Naïve Bayes, and Logistic Regression models exhibit varying degrees of classification accuracy in relation to claim classification, as illustrated in Figure 4. The SVM and Naïve Bayes models accurately classified 53 and 39 high claims and 59 and 69 low claims, respectively. In contrast, the logistic regression model correctly identified only 54 high claims and 53 low claims, indicating its comparatively lower classification accuracy for high claims when assessed against the other models. Nevertheless, all three models demonstrate notable classification errors, with the SVM misclassifying 22 low claims as high and 16 high claims as low. In the case of Naïve Bayes, 38 low claims were misclassified as high and 6 high claims as low; similarly, the logistic regression model misclassified 21 high claims as low and 22 low claims as high.

5.3. Results on Type I and Type II Errors

According to Table 5, Random Forest demonstrates superior performance compared with other algorithms with the lowest errors, 8.57% for Type I error and 13.75% for Type II error, followed by K-NN and XGBoost with Type I: 15. 38% and 16%, and Type II: 12.16% and 16%, respectively. All three approaches minimize false positives and false negatives effectively. Among all classification models tested, SVM and Decision Tree show poor test performance, with Naïve Bayes and logistic regression showing the lowest performance, with the highest error rates at 35.51% and 28.38% for Type I, respectively. Results show that Type I and Type II error rates increased slightly because of the small sample size, which reduces statistical power and increases variability in the classification results.

The study [35] showed that limited sample sizes generate high rates of Type I and Type II errors because statistical power is low while sample variability remains high, which increases false detection rates. Less than 1000 observations, often referred to as ‘small sample sizes,’ tend to overestimate an outcome’s performance, especially when proper validation methods are ignored. However, some small sample sizes are associated with higher reported classification accuracy, sometimes producing biased performance [36].

5.4. Variable Importance Analysis

A Random Forest model uses feature rankings to determine how well each feature contributes to the target variable prediction.

Figure 5 presents the variable importance derived from the Random Forest model, which is identified as the most robust model. Our results indicate that claim type, deferred period, vehicle brand, and driver age show the most decisive influence on high claim occurrence rates. The impacts of vehicle age and frequency are evident, while region, gender, and vehicle type are relatively less influential on the target variable. These findings provide insurance companies the ability to enhance their risk evaluation, optimize premiums and reduce high insurance claims through the identification of key factors.

According to Figure 6, the XGBoost variable importance analysis shows that claim type, vehicle age, age of the driver, and vehicle brand are the most influential factors in high claim occurrences. The independent variables deferred period and frequency demonstrate significant importance in high claim occurrences, but region, vehicle type, and gender show lower importance. We can conclude that both models with the highest accuracy, RF and XGBoost, show that claim type, driver age, and vehicle brand are the most important variables. Conversely, gender, region, and vehicle type features are identified as the least significant factors.

In classification models, SHAP (SHapley Additive exPlanations) computes the average marginal contribution of each feature across all possible feature combinations to explain its impact on the predicted class probability [14]. For the classification model we used Shapley values to understand the contribution of each feature in the model’s predictions. Based on the results shown in Figure 7, which depicts SHAP values corresponding to the high claims category, Vehicle Brand, Claim Type, Deferred Period, and Driver Age were determined to be the key factors in estimating bodily injury claims. The SHAP plots illustrate the relative importance and influence of features in classifying high claims using Random Forest (left) and XGBoost (right) models. Both models highlight “VehicleBrand” as the most influential feature, where larger values (in red) enhance the chances of a claim being in the high-risk group, and smaller values (in blu) indicate a negative impact, pushing the prediction to low claims. Similarly, “ClaimType” also had a significant effect on claim status: certain types of claims are linked to a greater chance of being classified as high claims (positive SHAP values).Other features, such as “DeferredPeriod”, “Frequency”, and “DriverAge”, show moderate influence, with SHAP values reflecting variable impact across the models. In contrast, features like “Gender”, “VehicleType”, and “Region” have relatively narrow SHAP distributions, suggesting limited predictive power in both models.

5.5. Robustness Evaluation of Model Performance

5.5.1. Tuning Parameters

The models were evaluated using 10-fold cross-validation, focusing on accuracy and AUC as the performance metric. The results of tuning the hyperparameters for the machine learning models performed in this study are summarized in Table 6. The Random Forest model performed best with an mtry value of 10. The best configuration for the XGBoost model included ta at 0.2, max_depth set to 9, subsample at 0.7, colsample_bytree at 0.7, nrounds at 100, and gamma at 0. The optimal parameters for the Decision Tree and K-NN models were Cp = 0.004 and K = 1. SVM demonstrated optimal performance with a Polynomial kernel and Cost = 1 while Gamma = 0.1. Naïve Bayes reached its best performance with Laplace = zero and Adjust = 0.5 while Usekernel was set to True.

5.5.2. Validation with 70/30 Data Split

This subsection presents the results of the experiments conducted with a partitioning of 70% for training and 30% for testing, aimed at verifying the robustness of the findings from Section 5.1. As shown in Table 7, the Random Forest model continues to outperform all other methods in predicting both high and low claims, followed in this split by XGBoost, which was the second-best performing model. The lowest-performing models were decision tree and logistic regression.

6. Discussion

This study aimed to evaluate the effectiveness of various machine learning models in classifying insurance claims into high and low categories, thereby supporting decision-making related to risk assessment and underwriting.

Considering the comparison results, RF and XGBoost show the highest classification accuracy and AUC values, which highlights their capability to predict complex outcomes in insurance datasets. The RF model demonstrated its superiority among the ensemble algorithms, achieving the best performance (CA = 0.8867, AUC = 0.9437) that aligns with findings from other studies [1,6,20,22] where RF was found to be the most powerful classifier in the most predictive modeling of insurance claim prediction, fraud detection, and customer behavior analytics. Supporting results were also found for XGBoost, where he showed high results (AUC = 0.9179), aligning with results from [16,24] that proved XGBoost could capture nonlinear patterns of structured data and interactions within insurance data.

On the other hand, for high-claim classifications, DT and LRmodels did not provide sufficient predictions. The results coincide with prior findings [20], where these models faced difficulties in the imbalance and complexity of real-world insurance data. SVM, NB, and K-NN perform well, but all three lack the flexibility and depth of tree-based ensemble methods on mixed-type data with interaction effects.

Analyzing the confusion matrices, we confirmed the reliability of RF and XGBoost in minimizing false negatives, which is very important in high-risk claim detection. One of the most significant risks an insurer can face is underestimating the risk of high claims, which can lead to financial exposure. Analysis of feature importance showed that claim type, age of the driver, vehicle brand, and deferred period have a significant impact on driving high-claim classifications. These results correspond with the literature findings [16,22,23], which emphasized the predictive value of behavioral and demographic variables. On the other hand, gender and vehicle type were found to have the least impact, which coincides with the study [25] supporting the relevance of feature selection in enhancing model performance. Dividing the data with a 70/30 ratio for a robustness check yielded comparable rankings, showcasing the consistency of the results. This proves that ensemble methods, particularly RF and XGBoost, provide uniform solutions irrespective of the data’s structure. For insurance analytics, these results are crucial. Insurers can utilize advanced ML algorithms to enhance risk classification, refine their premium models, and reduce losses from high-risk clients who are underestimated.

The findings of this study show that Random Forest and XGBoost performed better than other classification models in predicting bodily injury claim risk in a motor liability insurance portfolio. Both models achieved high performance and highlighted that claim type, deferred period, and driver age are the most important factors for high claim occurrence. These results provide insurers with valuable tools for improving risk segmentation and more accurately classifying the policyholders depending on their future claim behavior. In practical terms, the utilization of these models can provide individualized premium settings and more effectively underwrite strategies more precisely. For this reason, insurers can inform consumers about risks associated with different claims and adjust coverage accordingly. For instance, knowing that certain claim categories or deferred periods can result in higher risks helps insurers revise their policies or take preventive measures. Overall, this study underlines the need for incorporating machine learning models into insurance workflows, assisting the industry in following a data-driven and explainable AI practice while advocating for transparency.

7. Conclusions

This study aimed to analyze the effectiveness of classification models that could be used to predict insurance claims and determine the most important variables associated with high claim occurrences. Utilizing a dataset obtained from a private insurance company in Albania from the period 2018–2024, we studied 802 records, which included claim, vehicle, and insured variables. After preprocessing the dataset, we implemented logistic regression and machine learning methods such as Decision Tree, Random Forest, XGBoost, K-NN, SVM, and Naïve Bayes. Classification accuracy, area under the curve, and Type I and Type II error rates were some of the key indicators used to evaluate model performance.

The findings showed that the Random Forest model achieved the highest accuracy (CA = 0.8867) and AUC (0.9437) while maintaining the lowest error rates (Type I: 8.57%, Type II: 13.75%) in comparison to the other models. XGBoost was the second-best performing model, followed by K-NN, Naïve Bayes, SVM, Decision Tree, and Logistic Regression, which had the lowest performance. Moreover, we highlighted the important factors, such as claim type, age of driver, vehicle brand, and deferred period, that influence the high claim occurrence.

This study emphasizes the importance of machine learning models in improving the accuracy of claim classification and how they can serve policymakers and insurance firms. With the best predictive models, insurers will be able to assess risks better, improve underwriting, and even enhance fraud detection systems. This paper contributes to the body of knowledge by empirically validating ensemble methods as the most effective applied to real-world insurance datasets. It provides a benchmarking framework that can be used for future predictive modeling studies.

This study has some limitations. First, the dataset for this analysis is rather small, which could lead to sampling bias, thereby affecting claim classification. Moreover, the dataset spans from 2018 to 2024, which could limit the capture of trends and variations in insurance claims over time. Second, although significant predictive variables were found, adding other relevant factors and constant model supervision could improve the prediction model’s reliability and accuracy. Finally, none of the classification models is the best, as the classification methods’ effectiveness depends on the dataset, the complexity of the problem, and the case-specific insurance conditions of the market. Therefore, decision-makers and industry practitioners should use models that are the most accurate for the given situation.

Future research should emphasize expanding the dataset and study period to enhance model validity, include additional variables to improve predictive accuracy, and investigate advanced techniques such as neural networks and hybrid models for more precise forecasting.

Author Contributions

Conceptualization, E.B. and A.B.; methodology, E.B., A.B. and A.G.; software, E.B.; validation, E.B., A.B. and A.G.; formal analysis, A.G.; investigation, A.B.; resources, E.B.; data curation, E.B.; writing—original draft preparation, E.B.; writing—review and editing, A.B. and A.G.; visualization, E.B.; supervision, A.B. and A.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy and ethical reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Summary of Variables and Encoding.

Category	Variable	Definition	Type	Categories/Intervals
Claim factors	Individual Claims	The monetary amount claimed in insurance, based on the type of the claim.	Continuous	[5904, 15,900,000]
	Claim Type	The nature of the claim based on disability or fatality.	Categorical (ordinal)	General->1/Temporary disability->2/Partial permanent disability->3/Total permanent disability->4/Death->5
	Frequency	The number of claims made by the insured.	Numeric (discrete)	1, 2, 3, 4, 5
	Deferred Period	The waiting period (years) before claims are paid.	Numeric (discrete)	0, 1, 2, 3, 4, 5
Vehicle factors	Vehicle Type	The classification of the insured vehicle	Categorical (ordinal)	Motorcycles->1/Auto->2/Light trucks->3/Vans, Minibuses->4/Trucks, Buses->5
	Vehicle Brand	The Brand of the insured vehicle.	Categorical (nominal)	Mercedes Benz->19/Volkswagen->18/Ford->17/Audi->16/BMW->15/Opel->14/DaimlerChrysler->13/Fiat->12/Peugeot->11/Toyota->10/Suzuki->9/Volvo->8/Mitsubishi->7/Renault->6/Land Rover->5/Nissan->4/Iveco->3/Citroen->2/Seat->1/Others->0
	Vehicle Age	The number of years since the vehicle was manufactured.	Continuous	[1, 43]
Insured-related factors	Driver Age	The age of the insured driver.	Continuous	[16, 86]
	Region	The geographical area where the policyholder resides.	Categorical (nominal)	Center->1/South->2/Nord->3
	Gender	The gender of the insured driver.	Categorical (nominal)	Male->1/Female->0

Source: Authors’ elaborations.

Table A2. The distribution of high and low claims based on various factors.

Variable	Category	High	Low	Total
Claim Type	1	12 (1.5%)	5 (0.62%)	17 (2.12%)
	2	347 (43.38%)	223 (27.88%)	570 (71.25%)
	3	19 (2.38%)	130 (16.25%)	149 (18.62%)
	4	0 (0%)	4 (0.5%)	4 (0.5%)
	5	1 (0.12%)	59 (7.38%)	60 (7.5%)
Vehicle Type	1	11 (1.38%)	16 (2%)	27 (3.38%)
	2	306 (38.25%)	334 (41.75%)	640 (80%)
	3	27 (3.38%)	30 (3.75%)	57 (7.12%)
	4	17 (2.12%)	24 (3%)	41 (5.12%)
	5	18 (2.25%)	17 (2.12%)	35 (4.38%)
Region	1	212 (26.5%)	222 (27.75%)	434 (54.25%)
	2	89 (11.12%)	98 (12.25%)	187 (23.38%)
	3	78 (9.75%)	101 (12.62%)	179 (22.38%)
Gender	0	80 (10%)	84 (10.5%)	164 (20.5%)
Gender	1	299 (37.38%)	337 (42.12%)	638 (79.5%)
Frequency	1	294 (36.75%)	295 (36.88%)	589 (73.62%)
	2	62 (7.75%)	60 (7.5%)	122 (15.25%)
	3	16 (2%)	36 (4.5%)	52 (6.5%)
	4	4 (0.5%)	22 (2.75%)	26 (3.25%)
	5	3 (0.38%)	8 (1%)	11 (1.38%)
Deferred Period	0	21 (2.62%)	9 (1.12%)	30 (3.75%)
	1	147 (18.38%)	90 (11.25%)	237 (29.62%)
	2	154 (19%)	202 (25.25%)	354 (44.25%)
	3	39 (4.88%)	72 (9%)	111 (13.88%)
	4	12 (1.5%)	37 (4.62%)	49 (6.12%)
	5	8 (1%)	11 (1.38%)	19 (2.38%)

Source: Authors’ calculations.

References

Rawat, S.; Rawat, A.; Kumar, D.; Sabitha, A.S. Application of Machine Learning and Data Visualization Techniques for Decision Support in the Insurance Sector. Int. J. Inf. Manag. Data Insights 2021, 1, 100012. [Google Scholar] [CrossRef]
Poufinas, T.; Gogas, P.; Papadimitriou, T.; Zaganidis, E. Machine Learning in Forecasting Motor Insurance Claims. Risks 2023, 11, 164. [Google Scholar] [CrossRef]
Prodanov, S. Indemnification of non-material damages caused by road traffic accidents—Ethical and financial aspects. Econ. Arch. 2017, 4, 3–14. Available online: https://ideas.repec.org/a/dat/earchi/y2017i4p3-14.html (accessed on 2 April 2025).
Wiedemann, M.; John, D. A practitioners approach to individual claims models for bodily injury claims in German non-life insurance. Z. Gesamte Versicherungswiss. 2021, 110, 225–254. [Google Scholar] [CrossRef]
Weerasinghe, K.P.M.L.P.; Wijegunasekara, M.C. A comparative study of data mining algorithms in the prediction of auto insurance claims. Eur. Int. J. Sci. Technol. 2016, 5, 47–54. Available online: https://eijst.org.uk/files/images/frontimages/gallery/vol._5_no._1/6._47-54.pdf (accessed on 10 March 2025).
Hanafy, M.; Ming, R. Classification of the Insureds Using Integrated Machine Learning Algorithms: A Comparative Study. Appl. Artif. Intell. AAI 2022, 36, 1–32. [Google Scholar] [CrossRef]
Alamir, E.; Urgessa, T.; Hunegnaw, A.; Gopikrishna, T. Motor Insurance Claim Status Prediction Using Machine Learning Techniques. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 3. [Google Scholar] [CrossRef]
Brati, E.; Braimllari, A. Review of Statistical and Machine Learning Methods Applied in Private Health Insurance. Albanian J. Econ. Bus. 2024, 41, 66–80. Available online: https://feut.edu.al/konferenca-botime/revista-shkencore/2096-albanian-journal-of-economy-and-business (accessed on 16 March 2025).
Ellili, N.; Nobanee, H.; Alsaiari, L.; Shanti, H.; Hillebrand, B.; Hassanain, N.; Elfout, L. The Applications of Big Data in the Insurance Industry: A Bibliometric and Systematic Review of Relevant Literature. J. Finance Data Sci. 2023, 9, 100102. [Google Scholar] [CrossRef]
Clemente, C.; Guerreiro, G.R.; Bravo, J.M. Modelling Motor Insurance Claim Frequency and Severity Using Gradient Boosting. Risks 2023, 11, 163. [Google Scholar] [CrossRef]
Permai, S.D.; Herdianto, K. Prediction of Health Insurance Claims Using Logistic Regression and XGBoost Methods. Procedia Comput. Sci. 2023, 227, 1012–1019. [Google Scholar] [CrossRef]
Brati, E.; Braimllari, A. Application of Bootstrap and Deterministic Methods for Reserving Claims in Private Health Insurance. Int. J. Math. Trends Technol. 2023, 69, 17–26. [Google Scholar] [CrossRef]
Brati, E.; Braimllari, A. A Comparative Analysis of Stochastic Approaches for Claims Reserving in Private Health Insurance. WSEAS Trans. Bus. Econ. 2025, 22, 130–143. [Google Scholar] [CrossRef]
Orji, U.; Ukwandu, E. Machine Learning for an Explainable Cost Prediction of Medical Insurance. Mach. Learn. Appl. 2024, 15, 100516. [Google Scholar] [CrossRef]
Vinora, A.; Surya, V.; Lloyds, E.; Kathir Pandian, B.; Deborah, R.N.; Gobinath, A. An Efficient Health Insurance Prediction System Using Machine Learning. In Proceedings of the 2023 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 14–15 December 2023; pp. 1–5. [Google Scholar] [CrossRef]
Maisog, J.M.; Li, W.; Xu, Y.; Hurley, B.; Shah, H.; Lemberg, R.; Gutfraind, A. Using Massive Health Insurance Claims Data to Predict Very High-Cost Claimants: A Machine Learning Approach. arXiv 2019, arXiv:1912.13032. [Google Scholar] [CrossRef]
Langenberger, B.; Schulte, T.; Groene, O. The application of machine learning to predict high-cost patients: A performance comparison of different models using healthcare claims data. PLoS ONE 2023, 18, e0279540. [Google Scholar] [CrossRef]
Grize, Y.-L.; Fischer, W.; Lützelschwab, C. Machine Learning Applications in Nonlife Insurance. Appl. Stoch. Models Bus. Ind. 2020, 36, 523–537. [Google Scholar] [CrossRef]
Alomair, G. Predictive Performance of Count Regression Models Versus Machine Learning Techniques: A Comparative Analysis Using an Automobile Insurance Claims Frequency Dataset. PLoS ONE 2024, 19, e0314975. [Google Scholar] [CrossRef]
Hanafy, M.; Ming, R. Machine Learning Approaches for Auto Insurance Big Data. Risks 2021, 9, 42. [Google Scholar] [CrossRef]
Nabrawi, E.; Alanazi, A. Fraud Detection in Healthcare Insurance Claims Using Machine Learning. Risks 2023, 11, 160. [Google Scholar] [CrossRef]
Mavundla, K.; Thakur, S.; Adetiba, E.; Abayomi, A. Predicting Cross-Selling Health Insurance Products Using Machine-Learning Techniques. J. Comput. Inf. Syst. 2024, 1–18. [Google Scholar] [CrossRef]
Yego, N.K.K.; Nkurunziza, J.; Kasozi, J. Predicting Health Insurance Uptake in Kenya Using Random Forest: An Analysis of Socio-Economic and Demographic Factors. PLoS ONE 2023, 18, e0294166. [Google Scholar] [CrossRef] [PubMed]
Wang, Y. Predictive Machine Learning for Underwriting Life and Health Insurance. In Proceedings of the Actuarial Society of South Africa’s 2021 Virtual Convention, Virtual, 19–22 October 2021; Available online: https://www.actuarialsociety.org.za/convention/wp-content/uploads/2021/10/2021-ASSA-Wang-FIN-reduced.pdf (accessed on 15 January 2025).
Taha, A.; Cosgrave, B.; Mckeever, S. Using Feature Selection with Machine Learning for Generation of Insurance Insights. Appl. Sci. 2022, 12, 3209. [Google Scholar] [CrossRef]
Adnan Aslam, M.; Murtaza, F.; Ehatisham Ul Haq, M.; Yasin, A.; Ali, N. SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning. Data 2025, 10, 27. [Google Scholar] [CrossRef]
Breskuvienė, D.; Dzemyda, G. Categorical Feature Encoding Techniques for Improved Classifier Performance When Dealing with Imbalanced Data of Fraudulent Transactions. Int. J. Comput. Commun. Control 2023, 18, 3. [Google Scholar] [CrossRef]
Lunardon, N.; Menardi, G.; Torelli, N. ROSE: A Package for Binary Imbalanced Learning. R J. 2014, 6, 79. [Google Scholar] [CrossRef]
Lee, C.-W.; Fu, M.-W.; Wang, C.-C.; Azis, M.I. Evaluating Machine Learning Algorithms for Financial Fraud Detection: Insights from Indonesia. Mathematics 2025, 13, 600. [Google Scholar] [CrossRef]
Dhamo, Z.; Gjeçi, A.; Zibri, A.; Prendi, X. Business Distress Prediction in Albania: An Analysis of Classification Methods. J. Risk Financ. Manag. 2025, 18, 118. [Google Scholar] [CrossRef]
AbdElminaam, D.S.; Farouk, M.; Shaker, N.; Elrashidy, O.; Elazab, R. An Efficient Framework for Predicting Medical Insurance Costs Using Machine Learning. J. Comput. Commun. 2024, 3, 55–64. [Google Scholar] [CrossRef]
Therneau, T.; Atkinson, B.; Ripley, B.; Venables, W.N.; Liaw, A.; Wiener, M.; Chen, T.; He, T.; Benesty, M.; Tang, Y.; et al. R Packages Used for Classification Modeling: Rpart, randomForest, xgboost, e1071, and Class; R Foundation for Statistical Computing: Vienna, Austria, 2023; Available online: https://cran.r-project.org (accessed on 10 November 2024).
Liu, C.-J.; Huang, T.-S.; Ho, P.-T.; Huang, J.-C.; Hsieh, C.-T. Correction: Machine Learning-Based E-Commerce Platform Repurchase Customer Prediction Model. PLoS ONE 2024, 19, e0315518. [Google Scholar] [CrossRef]
Ala’raj, M.; Abbod, M.; Radi, M. The Applicability of Credit Scoring Models in Emerging Economies: An Evidence from Jordan. Int. J. Islam. Middle East Financ. Manag. 2018, 11, 608–630. [Google Scholar] [CrossRef]
Rajput, D.; Wang, W.J.; Chen, C.C. Evaluation of a Decided Sample Size in Machine Learning Applications. BMC Bioinform. 2023, 24, 48. [Google Scholar] [CrossRef] [PubMed]
Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Machine Learning Algorithm Validation with a Limited Sample Size. PLoS ONE 2019, 14, e0224365. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The visualization of the AUCs for the models.

Figure 2. (a) Random Forest confusion matrix; (b) XGBoost confusion matrix.

Figure 3. (a) K-NN confusion matrix; (b) Decision Tree confusion matrix.

Figure 4. (a) SVM confusion matrix; (b) Naïve Bayes confusion matrix; (c) Logistic Regression confusion matrix.

Figure 5. Variable Importance rankings derived from the Random Forest model.

Figure 6. The variable importance rankings derived from the XGBoost model.

Figure 7. SHAP summary plot showing feature contributions to the classification model: (a) Random Forest; (b) XGBoost.

Table 1. Summary of reviewed studies on machine learning applications in insurance.

Study	Purpose	Algorithms	Dataset	Performance Metrics	Key Findings
[16]	High-cost claim prediction	LightGBM XGBoost	U.S. health Insurance data with 48 million observations (2017–2019)	Accuracy Precision F1-score Recall	LightGBM best performer; key predictors: age, rising cost, life expectancy
[6]	Insurance claim classification	Random Forest Decision Tree K-NN Logistic Regression	Three auto insurance datasets from Kaggle repository	AUC	RF achieved highest classification performance
[19]	Auto claim frequency prediction	SVM Poisson, Negative Binomial, Zero-Inflated Poisson,	Automobile insurance from SAS Enterprise Miner. (Dataset encompasses 10,303 observations and 33 variables.)	(Mean Absolute Error)	SVM outperformed others; improved risk pricing and estimation
[20]	Auto insurance for predicting claim occurrence.	Logistic Reg., XGBoost, RF, DT, NB, K-NN	Brazilian automotive data from Kaggle repository. (1.48 M observations, 59 features)	Accuracy AUC	RF achieved best results among tested models
[10]	Modeling motor insurance claim frequency and severity	Gradient Boosting (GB), Generalized Linear Models (GLMs)	European auto insurer (The dataset included 2,464,181 observation and 21 feature, during the period 2016–2019)	Friedman’s H-statistic	GB better on frequency, GLM better on severity; ML enables better risk management
[1]	Fraud detection	RF,DT,SVM,K-NN, Logistic Reg. Gaussian Naïve Bayes Bernoulli Naïve Bayes Mixed Naïve Bayes	Kaggle dataset (1338 records, 9 variables)	Accuracy: Precision Recall F1-Score AUC	RF most efficient for fraud detection
[21]	Health fraud detection	Random Forest, Logistic Regression, ANN	Saudi Arabia health insurance data. (Dataset included 396 observations from January 2022 to May 2022).	Accuracy Precision Recall F1-Score	RF outperformed other classifiers. Policy type, education, and age were identified as the most significant features.
[11]	Insurance fraud prediction	XGBoost, Logistic Regression	Indonesian private insurer data (Dataset includes 11,882 observations and 19 features).	Accuracy Precision Recall	XGBoost outperformed Logistic Regression
[18]	Application of ML in non-life insurance, focusing on Premium pricing.	XGBoost, Decision Tree, Neural Network	Data were obtained from 20 competitors over a 12-month period, consisting of about 30,000 observations and 70 features.	(Mean Absolute Percentage Error)	XGBoost had best performance in premium prediction.
[22]	Cross-selling behavior for health insurance	RF, K-NN, XGBoost, Logistic Regression	South African health insurance dataset (1 M records with 16 features)	Accuracy Precision Recall F1 Score Support	RF identified strong predictors; achieved high predictive accuracy
[23]	Predict health insurance uptake behavior	Random Forest, XGBoost, Logistic Regression.	Kenya FinAccess Survey Data in 2021. (The dataset includes 22,024 records with 23 features)	Accuracy Recall, F1 Score AUC	RF outperformed other models in prediction accuracy.
[24]	Underwriting optimization	XGBoost, Random Forest, Bagging, K-NN Gradient Boosting, SVM, Decision Tree, AdaBoost, Logistic Regression	Reinsurer’s life/health data during the period January 2017 to June 2020 (Dataset contains 29,317 observations and 37 variables.)	Accuracy Precision Recall F1-score	XGBoost outperformed other methods for underwriting
[25]	Feature selection in ML and noise reduction in improved model performance.	SVM and K-NN for classification tasks	Five public insurance datasets obtained from Kaggle machine learning repository.	Accuracy	These findings highlight the dual benefit of feature selection in enhancing model accuracy and offering interpretability and strategic value to insurers.

Source: Authors’ elaboration.

Table 2. Descriptive statistics of numerical features of the Insurance Dataset.

Variable	Mean	Standard Deviation	Min	Max
Claims	1,319,104.8	1,964,988.5	5904.48	15,900,000
Frequency	1.44	0.86	1	5
Driver Age	43.32	14.25	16	86
Vehicle Age	18.66	6.44	1	43
Deferred Period	1.96	1.02	0	5

Source: Authors’ calculations.

Table 3. Confusion matrix for claim classification.

	Predicted (Low)	Predicted (High)
Actual (Low)	True Negative (TN)	False Positive (FP) (Type I error)
Actual (High)	False Negative (FN) (Type II error)	True Positive (TP)

Table 4. Performance metrics (CA and AUC) of classification models 80%/20%.

ML Model	CA	AUC
Random Forest	0.8867	0.9437
XGBoost	0.84	0.9179
SVM	0.7467	0.8416
Naïve Bayes	0.7105	0.8703
Logistic Regression	0.7599	0.7978
Decision Tree	0.7467	0.8258
K-Nearest Neighbors	0.8618	0.8621

Source: Authors’ calculations.

Table 5. Type I and Type II Errors of ML classification models for testing data.

ML Model	Type I Error	Type II Error
Random Forest	8.57%	13.75%
XGBoost	16%	16%
SVM	27.16%	23.19%
Naïve Bayes	35.51%	13.33%
Logistic Regression	28.38%	28.95%
Decision Tree	22.39%	27.71%
K-Nearest Neighbors	15.38%	12.16%

Source: Authors’ calculations.

Table 6. Tuning parameters for machine learning models.

Model	Parameters	Range	Optimal Value
Random Forest	mtry	[3, 5, 9]	10
XGBoost	eta max_depth colsample_bytree subsample nrounds Gamma	[0.01, 0.1, 0.2] [3, 6, 9] [0.5, 0.7, 1] [0.5, 0.7, 1] [100, 200] [0 to 1]	0.2 9 0.7 0.7 100 0
Decision Tree	Cp	0 to 0.1	0.004
SVM	Kernel Cost Gama	[Linear, Radian, Polynomial] [0.1, 1, 10, 100] [0.001, 0.01, 0.1, 1]	Polynomial 1 0.1
K-NN	K	[1 to 10]	1
Naïve Bayes	Laplace Adjust Usekernel	[0 to 1] [0 to 1] [ALSE, TRUE]	0 0.5 TRUE

Source: Authors’ calculations.

Table 7. Performance metrics (CA and AUC) of classification models 70/30 data split.

ML Model	CA	AUC
Random Forest	0.8451	0.9163
XGBoost	0.8584	0.8935
SVM	0.7434	0.8189
Naïve Bayes	0.7281	0.8651
Logistic Regression	0.7599	0.7982
Decision Tree	0.7389	0.7882
K-nearest neighbors	0.8158	0.8158

Source: Authors’ calculations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Brati, E.; Braimllari, A.; Gjeçi, A. Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data. Data 2025, 10, 90. https://doi.org/10.3390/data10060090

AMA Style

Brati E, Braimllari A, Gjeçi A. Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data. Data. 2025; 10(6):90. https://doi.org/10.3390/data10060090

Chicago/Turabian Style

Brati, Esmeralda, Alma Braimllari, and Ardit Gjeçi. 2025. "Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data" Data 10, no. 6: 90. https://doi.org/10.3390/data10060090

APA Style

Brati, E., Braimllari, A., & Gjeçi, A. (2025). Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data. Data, 10(6), 90. https://doi.org/10.3390/data10060090

Article Menu

Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data

Abstract

1. Introduction

2. Literature Review

2.1. Claim Analysis

2.2. Fraud Detection

2.3. Premium Pricing and Sales Optimization

3. Data

3.1. Exploratory Data Analysis

3.2. Data Encoding

3.3. Data Transformation

3.4. Imbalanced Dataset

3.5. Training and Testing

4. Methods

4.1. Logistic Regression

4.2. Classification and Decision Tree (CART) Method

4.3. Random Forest

4.4. XGBoost

4.5. Support Vector Machine

4.6. K-Nearest Neighbors

4.7. Naïve Bayes

4.8. Model Evaluation

4.8.1. Confusion Matrix

4.8.2. Accuracy

4.8.3. Area Under the Curve (AUC)

4.8.4. Type I and Type II Errors

4.8.5. Hyperparameter Tuning

5. Results

5.1. Model Performance

5.2. Confusion Matrices Results

5.3. Results on Type I and Type II Errors

5.4. Variable Importance Analysis

5.5. Robustness Evaluation of Model Performance

5.5.1. Tuning Parameters

5.5.2. Validation with 70/30 Data Split

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI