1. Introduction
Life insurance protects beneficiaries if any accidental death or unexpected event happens [
1]. However, in countries with well-established social security systems, the demand for life insurance is frequently low [
2]. The perception of mortality risk and attitudes toward life insurance plays a vital role in purchasing a life insurance policy. Most households are aware of the monetary risk posed by mortality. This does not, however, lead to the purchase of life insurance, which may affect the sustainability of their finances. A survey by the Swiss Re Institute (2020) [
3] claims that households in all of the region’s countries favor other strategies for boosting their financial security over life insurance, such as increasing their income or purchasing medical/critical illness insurance.
Due to the COVID-19 pandemic breakout, life insurance policies have drawn particular attention and interest among the numerous insurance products. The epidemic affects the global economy in both positive and negative ways. A series of surveys with the public were conducted between March and June 2020 in the UK, the US, and Spain. The results revealed that 30% of respondents said COVID-19 had increased their likelihood of considering buying life insurance [
4]. However, based on the insurance barometer study conducted by LIMRA and Life Happens in 2021, life insurance ownership in the US decreased marginally in 2021, with only 52% of Americans claiming to have life insurance, a decrease from 54% in 2020 [
5].
According to Bank Negara Malaysia (Central Bank of Malaysia), demand for life insurance has increased over the last decade, with per capita life insurance premiums rising 156% from RM 797.00 (USD 176.97)1 in 2010 to RM 1250.00 (USD 277.56) in 2021. Total premiums from new life insurance income rose from RM 7.9 billion to RM 40.75 billion, while new life insurance policies increased from 1.5 million to 17.5 million units. Meanwhile, in 1990, per capita insurance premium expenditure was only RM 92.00 (USD 20.43), while total new premiums (RM 573 million) and the number of new life insurance contracts (498,338) were significantly lower as compared to the period of 2010 to 2021 [
6].
Figure 1a shows the number of policies and certificates in force from 2010 to 2021, which showed an increasing trend. However, there was a slight decrease in the number of life insurance policies in force in 2019. Meanwhile,
Figure 1b presents the distribution of new sums insured for life insurance, and the sum participated for Takaful Family. In 2020, the distribution of life insurance’s new sums decreased, possibly due to the COVID-19 crisis that affected the nation. The Malaysian life insurance market is still lagging compared to other established global and regional markets, notwithstanding the increase in active policies and premiums mentioned earlier. This is demonstrated by the persistently low insurance density per capita in USD between 2010 and 2021 compared to other Asian nations. Although Malaysia’s insurance intensity climbed by nearly 157% from USD 282.8 in 2010 to USD 444 in 2021—above the world average of USD 382—it still lags behind other developed Asian nations such as Taiwan (USD 3772), Hong Kong (USD 8433), and Singapore (USD 5414). Additionally, in 2021, the Malaysian life insurance penetration rate (the proportion of life insurance premiums to GDP) was estimated to be 3.9%, which is significantly lower than the rates of other developed Asian nations such as Hong Kong (17.3%), Taiwan (11.6%), Singapore (7.5%), and Japan (6.1%) [
7].
Based on the above facts, life insurance ownership is an interesting topic, especially in overcoming the low penetration rate problem. Families’ sustainability and financial health were correlated to underinsurance, and one of the methods to overcome underinsurance is to increase the life insurance penetration rate. Hence, various measures must be considered to increase the penetration rate or to attract potential policyholders. One of the initiatives is to classify which one is the potential life insurance purchaser. Data mining techniques are commonly used to discover intriguing patterns in datasets and deliver future helpful information. Different data mining methods work well for classifying customers as potential or non-potential customers.
This study focuses on life insurance ownership in Malaysia, with the status of life insurance purchase as the main observation and Malaysian sociodemographic status as the determinant. Hence, this article uses a data mining approach to classify customers based on their attributes to predict the class label for future customers, whether the potential customers will purchase a life insurance policy or not. This article’s findings will redound to society’s benefit, considering that insurance protection is vital to families’ sustainability and financial health. It may assist insurance companies in better improving their underwriting process in selecting potential purchasers. Besides that, observing the descriptive analysis of the sociodemographic information of the respondents may provide a better overview of the Malaysian target market accordingly, which may increase the country’s life insurance penetration rate. Apart from that, in machine learning and data mining, imbalanced datasets are a constant worry because they make it difficult for machine learning algorithms to efficiently learn minority classes. Hence, this study will also provide insight into the prediction with different sampling and ensemble methods throughout the classification process using data mining techniques because the dataset is imbalanced on a particular class label.
3. Materials and Method
Data mining methods have been used extensively for solving the classification of customers, patients, or any predictive model, especially for an extensive dataset. Bhatia et al. studied consumer life insurance purchasing behavior and stated that researchers could apply advanced supervised and unsupervised machine learning and Artificial Neural Network methods [
8]. Meanwhile, previous studies recommend employing alternative classification methods such as Random Forest, Naive Bayes, or even Artificial Neural Networks for consumer segmentation and profiling [
21]. This paper will use five classification models: Decision Tree, Logistic Regression, Naïve Bayes, Random Forest, and Artificial Neural Network.
A Decision Tree is a classification algorithm with tree-based structure to classify data by splitting them. The primary goal of data splitting is to discover common dataset behavior, which also aids in measuring prediction accuracy. This approach builds and trains a classification tree with leaf nodes and decision nodes based on logical principles [
31]. Logistic Regression is a classification technique using machine learning with binary dependent variables. In this approach, a logistic function is used to characterize the probabilities that describe the possible outcomes of a single experiment [
34]. Meanwhile, the Naïve Bayes algorithm determines the probability for each class using several independent input variables and the Bayesian theorem [
31].
On the other hand, Random Forest is a supervised learning approach that can handle classification and regression-related problems. It makes decisions by creating many trees, or a forest, to act as a collective. As an ensemble approach, Random Forest combines and associates several decision trees with a single basic learner model [
31]. Next, Artificial Neural Network (ANN) is a mathematical or computer model based on biological neural networks. It processes information using a connectionist computation method and comprises a network of artificial neurons. An ANN is often an adaptive system that modifies internal or external information throughout the learning phase to affect how it is structured [
35].
Table 1 presents several advantages and disadvantages of the selected machine learning approach. It is noted that every model shown has its pros and cons. However, these models are chosen due to the strength and recommendation of previous studies.
Data mining involves six major steps, from data collection to model deployment.
Step 1: Data Acquisition—This study used the 2019 Malaysian Household Survey from the Department of Statistics, Malaysia (DOSM). The objective of this study is to predict whether individuals purchase life insurance, where the target attribute of Life Insurance Ownership can be No or Yes. A No means the head of household did not purchase a life insurance policy, and a Yes implies purchasing a life insurance policy. This prediction fits within the scope of classification problems. It will help to understand whether having a life insurance policy may or may not be related to income category and life expectancy. Suppose this connection is verified at the end of this study, measures may be taken by the competent entities so that the correct target customers can be identified for a life insurance policy.
Step 2: Data Understanding—This step begins with identifying the type of data using either quantitative or qualitative and determining the data measurement level used (nominal, ordinal, interval, and ratio). The data comprised 12 inputs with life insurance ownership status as the target of the study, which is presented in
Table 2. The original income attribute is in the exact values; hence, this attribute has been categorized into three income categories (bottom 40%, B40, middle 40%, M40, and top 20%, T20). This is because the government frequently provides incentives or fund assistance based on the income category instead of exact values in Malaysia. Besides that, age has been grouped into two classes: non-prime working and prime working. Prime working age is assigned to those between 25 and 54.
Step 3: Data Preparation—The analysis will be performed using RapidMiner Studio Educational 9.10.011 (RM). The data preparation step covers all data preparation activities, such as cleaning, transformation, and modifying before modeling. These tasks include choosing which data should be included or excluded, considering the possibility of adding new attributes or changing those that already exist, and data cleaning [
38]. The dataset comprised 16,354 household data. However, only 14,270 data have been considered for further analysis considering that the labor force age in Malaysia is between 15 to 64 years old. Hence, any household age outside the range will be removed. Since the dataset is too large to run the outlier detection using RapidMiner, the outlier is run using SPSS by calculating the Mahalanobis distance. Any
p-value less than 0.001 is considered an outlier; hence, it is removed to produce a stable parameter. After removing the outlier, the total dataset used is 14,270. An overview of all the acronyms and their definitions is provided in
Table 3.
As shown in
Figure 2, the label attribute was not balanced. Therefore, it was necessary to use the pre-processing method to avoid any misrepresentation of the minority class. In this paper, we split the dataset using 5-k fold cross-validation. Pre-processing sampling techniques, such as under- and over-sampling, were only used on the training set. SMOTE operator was the oversampling technique used. When creating synthetic samples for the dataset, SMOTE employs the k-NN approach by choosing the k nearest neighbors from sample data and connecting them. SMOTE may help the majority–minority class border become distinguishable since it solely relies on minority class observation. After the SMOTE process, the class label became balanced with Yes = 12,947 and No = 12,947. For RUS techniques, the operator Sample has been used with three different ratios; 1:1 (No = 1323, Yes = 1323), 2:1 (No = 2646, Yes = 1323), and 3:1 (No = 3969, Yes = 1323).
Step 4: Modelling—In this stage, five predictive models are used to predict whether the customer will purchase a life insurance policy. The classifiers include DT, LR, NB, RF, and ANN. Classifiers are chosen based on a literature review from past studies on data mining models and the five best models in ROC evaluation. In this paper, we propose to compare the full model without any sampling process with the models that have undergone resampling and ensembled process as follows; (i) five classifiers with different sampling techniques, (ii) classifiers with bagging ensemble learning method, and (iii) classifiers with boosting ensemble method.
There are two stages in predictive modeling for life insurance ownership.
Figure 3 illustrates the user interface of life insurance ownership modeling without the ensemble learning method used in this study. Many copies of the data were produced in the initial stage because the dataset needed to be connected with multiple classifiers. A 5-fold cross-validation strategy was also used to apply the model. The data file is called in the first interface, and the attributes are selected based on the explained attributes in
Table 2 and
Table 3. Next, the cross-validation operator with 5-fold validation is used. The “Cross Validation” operator, sometimes called a nested operator in RM, comprises the training and testing subprocesses. The dataset for this cross-validation process was divided into K (number of fold) subsets. Each iteration used one subset for testing, and the remaining dataset divisions were used for training. As the testing dataset was unseen, applying the model’s training and validation in one procedure was considered a fair test.
Next, the data are divided into training and testing inside the cross-validation operator. The input port (on the left) received the training dataset and connected with the DT, as shown in the figure. After the learning phase, the trained model was sent to the testing phase, where testing data were used to apply the testing procedure further. Finally, the model was verified using the “Apply Model” operator, which was connected to the “Performance” operator to help measure various characteristics of the classification model. Various parameters can be chosen for each of the classifiers in the RM. Hence, the authors experimented with various parameter combinations throughout implementation to evaluate the models’ performance. Then, the parameters with the highest accuracy are chosen. Similar to
Figure 3, other classifiers with different sampling techniques adhere to the same process.
Meanwhile, the ensemble learning method is divided into three stages: (i) the initial stage, (ii) inside the cross-validation process, and (iii) inside the bagging/boosting process. Bagging and boosting (AdaBoost) were used as the ensemble learning method to see the differences in the performance metrics for each classifier.
Figure 4 and
Figure 5 demonstrate the implementation of bagging and boosting in the ensemble learning method. As for bagging, enhancing the performance of the classification model is the key motivation behind choosing this ensemble method [
31]. A meta-algorithm known as bagging is renowned for its aggregation capabilities. The working scenario for this technique is based on bootstrapping, which separates the original dataset into numerous training datasets known as bootstraps. The datasets were divided to develop numerous models, which will eventually be combined to produce a powerful learner. The sub-process of this operator, which will use various learner models, is known as a nested operator.
Boosting is a popular ensemble strategy in machine learning that combines numerous models to obtain a robust model. It achieves this goal by merely training several learning models consecutively, then combining them based on discovered learning model errors. Besides that, according to research by Nazemi et al., boosting is helpful in reducing bias and variance [
39]. One of the boosting algorithms that can be used with other learning algorithms is called AdaBoost, which stands for adaptive boosting [
31]. The meta-algorithm used to implement AdaBoost in the RM tool can run the process by adding another algorithm as a sub-process. After running and training numerous models, it combines weak learners to form a single strong learner, adding extra calculation and running time. In this study, the classification model is trained using AdaBoost’s ensemble method that combines five additional algorithms as sub-processes. The primary goal of using AdaBoost is to compare decision-making models with and without boosted approaches in terms of performance and accuracy. The results and discussion section examines the model’s overall performance. Similar to
Figure 4 and
Figure 5, other classifiers with different sampling techniques follow the same process.
Step 5: Model Assessment and Comparison—In the final phase, it is necessary to evaluate the results and review the steps performed in detail [
38]. The performance of each tested model was assessed using the confusion matrix, which includes the number of TP, FP, TN, and FN. The accuracy, precision, and recall measures can be calculated using these parameters. Accuracy measures the model’s ability to capture true positives as being positive and true negatives as being negatives. Precision is calculated by dividing the true positives by anything predicted as a positive. In contrast, Recall is calculated by dividing the true positives by anything that should have been predicted as positive. The formulas for the performance metrics used in our analysis are shown in
Table 4.
The tools from which various accuracy measures are derived include an ROC chart and statistics such as accuracy, F
1-score, and ROC index. Tékouabou, Alaoui, et al. (2022) [
40] state that the F
1-score balances recall and precision (also known as sensitivity), accounting for both minority and majority classes. Hence, it is one of the good indicators for choosing the best model for classification problems. Other useful metrics are BA and GM. BA is the average of the two rates for accurately classifying positive and negative events. Contrary to accuracy, the BA is strong for evaluating classifiers on imbalanced datasets [
41]. GM is also an effective indicator for imbalanced data classification binary problems.
5. Conclusions
Based on the results obtained in
Section 4, it can be concluded that the performances of all classifiers that have undergone a sampling method are almost identical. SMOTE has slightly higher performance in balanced accuracy, AUC index, and GM. Even though the accuracy of the original dataset with no sampling applied has the highest accuracy value, other performance measurements showed the lowest. Hence, the imbalanced dataset proved to be highly accurate but misrepresented the majority class. Logistic Regression showed consistent performance with or without ensembled (bagging and boosting) for SMOTE and RUS 1:1. Even though the LR model in RUS 1:1 showed the highest value in F
1 measures compared to other models, LR + SMOTE has been outperformed in most of the criteria. Hence, it can be concluded that SMOTE is the best sampling method [
29], and it is pretty challenging to determine which model is the best to predict the potential life insurance purchasers in these data. Without feature selection in the dataset, LR has performed better, which is supported by Kaushik et al. [
24] research. However, similar studies on classifying financial decisions show that nonlinear methods such as Neural Networks [
24] and Random Forest [
23] perform significantly better. Moreover, the analysis shows that the decision tree is the best performer according to ROC and, according to balanced accuracy, F1 score, and GM comparison, Naïve Bayes seems to be the best performer. Hence, the decision will depend on which performance criteria a researcher wishes to focus on.
By applying the ensembled method, we believe that it presented various performance comparisons. The performance of the ensembled models (bagging or boosting) significantly improves the models’ performances. However, the performances differ from one model to another. Based on the results discussed in
Section 4.2, the researchers should evaluate each model with different ensembled methods to determine which classifier and ensemble method is appropriate for the dataset being studied. The findings of this study will be beneficial to society since insurance protection is important to the sustainability and financial well-being of families. It may help the insurance business select potential buyers more effectively through a better underwriting process. Additionally, this study will shed light on the prediction using various sampling and ensemble approaches throughout the classification process employing data mining techniques with an imbalanced dataset.
Despite the beneficial findings, this study has been conducted based on this limitation. It does not consider feature selection, which may affect the study’s outcome. Feature selection may provide a more thorough comparison, as it only considers the most significant attributes in the study. Perez et al. [
29] stated that SMOTE performs better after applying feature selection. However, since this study aims to compare the different sampling and ensemble methods on the imbalanced dataset, the feature selection criteria have been excluded from this study. Hence, we recommend that future research considers the feature selection for a better understanding and comparison. This study proposes an approach to determine the usefulness of various artificial intelligence approaches using machine learning to predict life insurance ownership. This approach may also be used in countries with roughly identical life insurance penetration rates. Further improvements may also be achieved by including more socioeconomic status parameters and economic factors.