Bi-Partitioned Feature-Weighted K-Means Clustering for Detecting Insurance Fraud Claim Patterns
Abstract
1. Introduction
2. Related Works
3. Methods
3.1. Classical K-Means
3.2. Distinguishing BPW K-Means from Fuzzy Clustering: Principles, Performance, and Applications
3.3. Bi-Partition Weighted K-Means
| Algorithm 1. The bi-partition weighted K-means algorithm for clustering |
then continue:
|
3.4. Clustering Performance Metrics
3.4.1. Rand Index (RI)
3.4.2. Adjusted Rand Index (ARI)
3.4.3. Algorithms for Measuring and Validating Clustering Quality Metrics
| Algorithm 2. Accuracy function used for measuring the accuracy of clusters. |
|
| Algorithm 3. Validation method for clustering performance metrics. |
|
4. Exploratory Analysis of Insurance Fraud Claims: Insights and Feature Rankings
5. Clustering Performance of BPW -Means vs. Classical -Means on the Insurance Fraud Claims Dataset: Ranked vs. Unranked Features
5.1. Empirical Analysis of Optimal and Bi-Partition Number Pairing
Output 1:
Clustering Performance of the BPW K-Means Algorithm on
the Insurance Fraud Claims Dataset---Ranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.9, Bi-Partition Number = 1
Fraudulent Non-Fraudulent
1 44 3
2 5 42
Overall Statistics
Accuracy: 0.9149
95% CI: (0.8392, 0.9625)
No Information Rate: 0.5213
P-Value [Acc > NIR]: <2e-16
Kappa: 0.8298
McNemar's Test P-Value: 0.7237
Sensitivity: 0.8980
Specificity: 0.9333
Pos Pred Value: 0.9362
Neg Pred Value: 0.8936
Precision: 0.9362
Recall: 0.8980
F1: 0.9167
Prevalence: 0.5213
Detection Rate: 0.4681
Detection Prevalence: 0.5000
Balanced Accuracy: 0.9156
5.2. Classical K-Means and BPW K-Means Clustering of Insurance Claims
5.3. Comparative Evaluation of Clustering Performance: BPW K-Means vs. Classical K-Means Across Different Cluster Groupings
5.3.1. Analysis of Clustering Performance with Bi-Partition Number Fixed at 1
5.3.2. Analysis of Clustering Performance with Value Fixed at 0.9
6. Practical Implications and Applications of the BPW -Means Algorithm
7. Application of Clustering Methods to Other Datasets
7.1. Clustering Performance of BPW K-Means vs. Classical K-Means on the Iris Dataset: Ranked vs. Unranked Features
Output 2: Clustering Performance of the BPW K-Means Algorithm on
the Iris Dataset---Ranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.9, Bi-Partition Number = 1
setosa versicolor virginica
1 50 0 0
2 0 48 2
3 0 4 46
Overall Statistics
Accuracy: 0.96
95% CI: (0.915, 0.9852)
No Information Rate: 0.3467
P-Value [Acc > NIR]: < 2.2e-16
Kappa: 0.94
McNemar's Test P-Value: NA
Statistics by Class:
Class:1 Class:2 Class:3
Sensitivity 1.0000 0.9231 0.9583
Specificity 1.0000 0.9796 0.9608
Pos Pred Value 1.0000 0.9600 0.9200
Neg Pred Value 1.0000 0.9600 0.9800
Precision 1.0000 0.9600 0.9200
Recall 1.0000 0.9231 0.9583
F1 1.0000 0.9412 0.9388
Prevalence 0.3333 0.3467 0.3200
Detection Rate 0.3333 0.3200 0.3067
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9513 0.9596
7.2. Clustering Performance of the BPW K-Means vs. Classical K-Means on the Sirtuin6 Dataset: Ranked vs. Unranked Features
- First row: In Figure 12a, the BPW K-means achieved an optimal accuracy of 79% with the pair = 0.8 and bi-partition number = 1 for the unranked data type (left column). Similarly, in Figure 12f, the ranked data type (right column) achieved the highest accuracies with values of 0.2 and 0.3, paired with the same bi-partition number 1. These pairs achieved a potential optimal accuracy of 81%.
- Second row: In Figure 12b, the unranked data type achieved an optimal pair with = 0.1 and bi-partition number = 2, leading to a potential accuracy of 82%. Likewise, in Figure 12g, the ranked data type achieved optimal accuracies of 81% using values of 0.2 and 0.3 paired with bi-partition number = 2.
- Fourth row: In Figure 12d, the BPW K-means achieved a potential optimal accuracy of 82% for the unranked data type with the pair = 0.1 and bi-partition number = 4. For the ranked data type (Figure 12i), values of 0.2, 0.3, and 0.4 paired with bi-partition number = 4 achieved a slightly lower accuracy of 81%.
- Fifth row: Figure 12e,j exhibit similar patterns, where the BPW K-means algorithm achieved potential optimal accuracies of 79% and 78%, respectively, using = 0.1 and bi-partition number = 5 for both ranked and unranked data types.
Output 3: Clustering Performance of the BPW K-Means Algorithm on
the Sirtuin6 Dataset---Ranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.1, Bi-Partition Number = 3
Low BFE High BFE
1 41 9
2 9 41
Overall Statistics
Accuracy: 0.82
95% CI: (0.7305, 0.8897)
No Information Rate: 0.5
P-Value [Acc > NIR]: 3.074e-11
Kappa: 0.64
McNemar's Test P-Value: 1
Sensitivity: 0.82
Specificity: 0.82
Pos Pred Value: 0.82
Neg Pred Value: 0.82
Precision: 0.82
Recall: 0.82
F1: 0.82
Prevalence: 0.50
Detection Rate: 0.41
Detection Prevalence: 0.50
Balanced Accuracy: 0.82
7.3. Clustering Performance of the BPW K-Means vs. Classical K-Means on the Wholesale Customers Dataset: Ranked vs. Unranked Features
- First row: From Figure 13a,f, the unranked dataset achieved a potential optimal accuracy of 85% with the pair ( = 0.6, bi-partition number = 1), while the ranked dataset achieved a potential accuracy of 82% with the pair ( = 0.9, bi-partition number = 1).
- Second row: In Figure 13b,g, the BPW K-means achieved an accuracy of 83% for both the unranked and ranked datasets, with the optimal pair ( = 0.7, bi-partition number = 2) for the unranked dataset and ( = 0.3, bi-partition number = 2) for the ranked dataset.
Output 4: Clustering Performance of the BPW K-Means Algorithm on
the Wholesale Customers Dataset---Ranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.4, Bi-Partition Number = 3
Horeca Retail
1 288 10
2 56 86
Overall Statistics
Accuracy: 0.85
95% CI: (0.8132, 0.8821)
No Information Rate: 0.7818
P-Value [Acc > NIR]: 0.0001987
Kappa: 0.6251
McNemar's Test P-Value: 3.04e-08
Sensitivity: 0.8372
Specificity: 0.8958
Pos Pred Value: 0.9664
Neg Pred Value: 0.6056
Precision: 0.9664
Recall: 0.8372
F1: 0.8972
Prevalence: 0.7818
Detection Rate: 0.6545
Detection Prevalence: 0.6773
Balanced Accuracy: 0.8665
8. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Output A1: Clustering Performance of the BPW K-Means Algorithm on
the Insurance Fraud Claims Dataset---Unranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.9, Bi-Partition Number = 1
Fraudulent Non-Fraudulent
1 44 3
2 5 42
Overall Statistics
Accuracy: 0.9149
95% CI: (0.8392, 0.9625)
No Information Rate: 0.5213
P-Value [Acc > NIR]: <2e-16
Kappa: 0.8298
McNemar's Test P-Value: 0.7237
Sensitivity: 0.8980
Specificity: 0.9333
Pos Pred Value: 0.9362
Neg Pred Value: 0.8936
Precision: 0.9362
Recall: 0.8980
F1: 0.9167
Prevalence: 0.5213
Detection Rate: 0.4681
Detection Prevalence: 0.5000
Balanced Accuracy: 0.9156
Output A2: Clustering Performance of the BPW K-Means Algorithm on
the Insurance Fraud Claims Dataset---Unranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.9, Bi-Partition Number = 2
Fraudulent Non-Fraudulent
1 44 3
2 4 43
Overall Statistics
Accuracy: 0.9255
95% CI: (0.8526, 0.9695)
No Information Rate: 0.5106
P-Value [Acc > NIR]: <2e-16
Kappa: 0.8511
McNemar's Test P-Value: 1
Sensitivity: 0.9167
Specificity: 0.9348
Pos Pred Value: 0.9362
Neg Pred Value: 0.9149
Precision: 0.9362
Recall: 0.9167
F1: 0.9263
Prevalence: 0.5106
Detection Rate: 0.4681
Detection Prevalence: 0.5000
Balanced Accuracy: 0.9257
Output A3: Clustering Performance of the BPW K-Means Algorithm on
the Insurance Fraud Claims Dataset---Unranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.9, Bi-Partition Number = 3
Fraudulent Non-Fraudulent
1 45 2
2 5 42
Overall Statistics
Accuracy: 0.9255
95% CI: (0.8526, 0.9695)
No Information Rate: 0.5319
P-Value [Acc > NIR]: <2e-16
Kappa: 0.8511
McNemar's Test P-Value: 0.4497
Sensitivity: 0.9000
Specificity: 0.9545
Pos Pred Value: 0.9574
Neg Pred Value: 0.8936
Precision: 0.9574
Recall: 0.9000
F1: 0.9278
Prevalence: 0.5319
Detection Rate: 0.4787
Detection Prevalence: 0.5000
Balanced Accuracy: 0.9273
Output A4: Clustering Performance of the Classical K-Means
Algorithm on the Insurance Fraud Claims Dataset---Ranked
Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = NA, Bi-Partition Number = NA
Fraudulent Non-Fraudulent
1 5 42
2 5 42
Overall Statistics
Accuracy: 0.5
95% CI: (0.3951, 0.6049)
No Information Rate: 0.8936
P-Value [Acc > NIR]: 1
Kappa: 0
McNemar's Test P-Value: 1.512e-07
Sensitivity: 0.50000
Specificity: 0.50000
Pos Pred Value: 0.10638
Neg Pred Value: 0.89362
Precision: 0.10638
Recall: 0.50000
F1: 0.17544
Prevalence: 0.10638
Detection Rate: 0.05319
Detection Prevalence: 0.50000
Balanced Accuracy: 0.50000
Output A5: Clustering Performance of the BPW K-Means Algorithm
on the Iris Dataset---Ranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.9, Bi-Partition Number = 2
setosa versicolor virginica
1 50 0 0
2 0 50 0
3 0 8 42
Overall Statistics
Accuracy: 0.9467
95% CI: (0.8976, 0.9767)
No Information Rate: 0.3867
P-Value [Acc > NIR]: < 2.2e-16
Kappa: 0.92
McNemar's Test P-Value: NA
Statistics by Class:
Class:1 Class:2 Class:3
Sensitivity 1.0000 0.8621 1.0000
Specificity 1.0000 1.0000 0.9259
Pos Pred Value 1.0000 1.0000 0.8400
Neg Pred Value 1.0000 0.9200 1.0000
Precision 1.0000 1.0000 0.8400
Recall 1.0000 0.8621 1.0000
F1 1.0000 0.9259 0.9130
Prevalence 0.3333 0.3867 0.2800
Detection Rate 0.3333 0.3333 0.2800
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9310 0.9630
Output A6: Clustering Performance of the BPW K-Means Algorithm
on the Iris Dataset---Ranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.9, Bi-Partition Number = 3
setosa versicolor virginica
1 50 0 0
2 0 48 2
3 0 14 36
Overall Statistics
Accuracy: 0.8933
95% CI: (0.8326, 0.9378)
No Information Rate: 0.4133
P-Value [Acc > NIR]: < 2.2e-16
Kappa: 0.84
McNemar's Test P-Value: NA
Statistics by Class:
Class:1 Class:2 Class:3
Sensitivity 1.0000 0.7742 0.9474
Specificity 1.0000 0.9773 0.8750
Pos Pred Value 1.0000 0.9600 0.7200
Neg Pred Value 1.0000 0.8600 0.9800
Precision 1.0000 0.9600 0.7200
Recall 1.0000 0.7742 0.9474
F1 1.0000 0.8571 0.8182
Prevalence 0.3333 0.4133 0.2533
Detection Rate 0.3333 0.3200 0.2400
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.8757 0.9112
Output A7: Clustering Performance of the Classical K-Means
Algorithm on the Iris Dataset---Ranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = NA, Bi-Partition Number = NA
setosa versicolor virginica
1 50 0 0
2 0 48 2
3 0 14 36
Overall Statistics
Accuracy: 0.8933
95% CI: (0.8326, 0.9378)
No Information Rate: 0.4133
P-Value [Acc > NIR]: < 2.2e-16
Kappa: 0.84
McNemar's Test P-Value: NA
Statistics by Class:
Class:1 Class:2 Class:3
Sensitivity 1.0000 0.7742 0.9474
Specificity 1.0000 0.9773 0.8750
Pos Pred Value 1.0000 0.9600 0.7200
Neg Pred Value 1.0000 0.8600 0.9800
Precision 1.0000 0.9600 0.7200
Recall 1.0000 0.7742 0.9474
F1 1.0000 0.8571 0.8182
Prevalence 0.3333 0.4133 0.2533
Detection Rate 0.3333 0.3200 0.2400
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.8757 0.9112
Output A8: Clustering Performance of the BPW K-Means Algorithm
on the Iris Dataset---Unranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.1, Bi-Partition Number = 1
setosa versicolor virginica
1 50 0 0
2 0 50 0
3 0 10 40
Overall Statistics
Accuracy: 0.9333
95% CI: (0.8808, 0.9676)
No Information Rate: 0.4
P-Value [Acc > NIR]: < 2.2e-16
Kappa: 0.9
McNemar's Test P-Value: NA
Statistics by Class:
Class:1 Class:2 Class:3
Sensitivity 1.0000 0.8333 1.0000
Specificity 1.0000 1.0000 0.9091
Pos Pred Value 1.0000 1.0000 0.8000
Neg Pred Value 1.0000 0.9000 1.0000
Precision 1.0000 1.0000 0.8000
Recall 1.0000 0.8333 1.0000
F1 1.0000 0.9091 0.8889
Prevalence 0.3333 0.4000 0.2667
Detection Rate 0.3333 0.3333 0.2667
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9167 0.9545
Output A9: Clustering Performance of the BPW K-Means Algorithm
on the Iris Dataset---Unranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.1, Bi-Partition Number = 2
setosa versicolor virginica
1 50 0 0
2 0 50 0
3 0 8 42
Overall Statistics
Accuracy: 0.9467
95% CI: (0.8976, 0.9767)
No Information Rate: 0.3867
P-Value [Acc > NIR]: < 2.2e-16
Kappa: 0.92
McNemar's Test P-Value: NA
Statistics by Class:
Class:1 Class:2 Class:3
Sensitivity 1.0000 0.8621 1.0000
Specificity 1.0000 1.0000 0.9259
Pos Pred Value 1.0000 1.0000 0.8400
Neg Pred Value 1.0000 0.9200 1.0000
Precision 1.0000 1.0000 0.8400
Recall 1.0000 0.8621 1.0000
F1 1.0000 0.9259 0.9130
Prevalence 0.3333 0.3867 0.2800
Detection Rate 0.3333 0.3333 0.2800
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9310 0.9630
Output A10: Clustering Performance of the BPW K-Means Algorithm
on the Iris Dataset---Unranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.1, Bi-Partition Number = 3
setosa versicolor virginica
1 50 0 0
2 0 48 2
3 0 4 46
Overall Statistics
Accuracy: 0.96
95% CI: (0.915, 0.9852)
No Information Rate: 0.3467
P-Value [Acc > NIR]: < 2.2e-16
Kappa: 0.94
McNemar's Test P-Value: NA
Statistics by Class:
Class:1 Class:2 Class:3
Sensitivity 1.0000 0.9231 0.9583
Specificity 1.0000 0.9796 0.9608
Pos Pred Value 1.0000 0.9600 0.9200
Neg Pred Value 1.0000 0.9600 0.9800
Precision 1.0000 0.9600 0.9200
Recall 1.0000 0.9231 0.9583
F1 1.0000 0.9412 0.9388
Prevalence 0.3333 0.3467 0.3200
Detection Rate 0.3333 0.3200 0.3067
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9513 0.9596
Output A11: Clustering Performance of the BPW K-Means Algorithm
on the Sirtuin6 Dataset---Unranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.1, Bi-Partition Number = 2
Low BFE High BFE
1 41 9
2 9 41
Overall Statistics
Accuracy: 0.82
95% CI: (0.7305, 0.8897)
No Information Rate: 0.5
P-Value [Acc > NIR]: 3.074e-11
Kappa: 0.64
McNemar's Test P-Value: 1
Sensitivity: 0.82
Specificity: 0.82
Pos Pred Value: 0.82
Neg Pred Value: 0.82
Precision: 0.82
Recall: 0.82
F1: 0.82
Prevalence: 0.50
Detection Rate: 0.41
Detection Prevalence: 0.50
Balanced Accuracy: 0.82
Output A12: Clustering Performance of the Classical K-Means
Algorithm on the Sirtuin6 Dataset---Ranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = NA, Bi-Partition Number = NA
Low BFE High BFE
1 44 6
2 18 32
Overall Statistics
Accuracy: 0.76
95% CI: (0.6643, 0.8398)
No Information Rate: 0.62
P-Value [Acc > NIR]: 0.002122
Kappa: 0.52
McNemar's Test P-Value: 0.024745
Sensitivity: 0.7097
Specificity: 0.8421
Pos Pred Value: 0.8800
Neg Pred Value: 0.6400
Precision: 0.8800
Recall: 0.7097
F1: 0.7857
Prevalence: 0.6200
Detection Rate: 0.4400
Detection Prevalence: 0.5000
Balanced Accuracy: 0.7759
Output A13: Clustering Performance of the BPW K-Means Algorithm on
the Wholesale Customers Dataset---Unranked Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = 0.6, Bi-Partition Number = 2
Horeca Retail
1 293 5
2 72 70
Overall Statistics
Accuracy: 0.825
95% CI: (0.7862, 0.8594)
No Information Rate: 0.8295
P-Value [Acc > NIR]: 0.6291
Kappa: 0.5433
McNemar's Test P-Value: 5.419e-14
Sensitivity: 0.8027
Specificity: 0.9333
Pos Pred Value: 0.9832
Neg Pred Value: 0.4930
Precision: 0.9832
Recall: 0.8027
F1: 0.8839
Prevalence: 0.8295
Detection Rate: 0.6659
Detection Prevalence: 0.6773
Balanced Accuracy: 0.8680
Output A14: Clustering Performance of the Classical K-Means
Algorithm on the Wholesale Customers Dataset---Ranked
Features
Confusion Matrix and Statistics
Parameter Values Used:
Beta = NA, Bi-Partition Number = NA
Horeca Retail
1 247 51
2 128 14
Overall Statistics
Accuracy: 0.5932
95% CI: (0.5456, 0.6395)
No Information Rate: 0.8523
P-Value [Acc > NIR]: 1
Kappa: -0.0845
McNemar's Test P-Value: 1.343e-08
Sensitivity: 0.65867
Specificity: 0.21538
Pos Pred Value: 0.82886
Neg Pred Value: 0.09859
Precision: 0.82886
Recall: 0.65867
F1: 0.73403
Prevalence: 0.85227
Detection Rate: 0.56136
Detection Prevalence: 0.67727
Balanced Accuracy: 0.43703
References
- McCaffery, K. Financial Services Regulatory Authority of Ontario. Automobile Insurance. 2023. Available online: https://insurance-portal.ca/article/large-number-of-ontario-drivers-believe-auto-insurance-fraud-is-prevalent/ (accessed on 10 November 2023).
- Lekha, K.C.; Prakasam, S. Data mining techniques in detecting and predicting cyber crimes in banking sector. In Proceedings of the 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), Chennai, India, 1–2 August 2017; pp. 1639–1643. [Google Scholar]
- Nassar, O.A.; Al Saiyd, N.A. The integrating between web usage mining and data mining techniques. In Proceedings of the 2013 5th International Conference on Computer Science and Information Technology, Amman, Jordan, 27–28 March 2013; pp. 243–247. [Google Scholar]
- Kowshalya, G.; Nandhini, M. Predicting fraudulent claims in automobile insurance. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; pp. 1338–1343. [Google Scholar]
- Patel, D.K.; Subudhi, S. Application of extreme learning machine in detecting auto insurance fraud. In Proceedings of the 2019 International Conference on Applied Machine Learning (ICAML), Bhubaneswar, India, 25–26 May 2019; pp. 78–81. [Google Scholar]
- Óskarsdóttir, M.; Ahmed, W.; Antonio, K.; Baesens, B.; Dendievel, R.; Donas, T.; Reynkens, T. Social network analytics for supervised fraud detection in insurance. Risk Anal. 2022, 42, 1872–1890. [Google Scholar] [CrossRef] [PubMed]
- Bodaghi, A.; Teimourpour, B. Automobile insurance fraud detection using social network analysis. In Applications of Data Management and Analysis: Case Studies in Social Networks and Beyond; Springer: Berlin/Heidelberg, Germany, 2018; pp. 11–16. [Google Scholar]
- Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar]
- Kaggle Dataset. Vehicle Insurance Claim Fraud Detection. 2021. Available online: https://www.kaggle.com/datasets/aashishjhamtani/automobile-insurance (accessed on 17 May 2023).
- Kelly, M.; Longjohn, R.; Nottingham, K. The UCI Machine Learning Repository. 2023. Available online: https://archive.ics.uci.edu (accessed on 18 December 2024).
- Nian, K.; Zhang, H.; Tayal, A.; Coleman, T.; Li, Y. Auto insurance fraud detection using unsupervised spectral ranking for anomaly. J. Financ. Data Sci. 2016, 2, 58–75. [Google Scholar] [CrossRef]
- Yang, J.; Chen, K.; Ding, K.; Na, C.; Wang, M. Auto insurance fraud detection with multimodal learning. Data Intell. 2023, 5, 388–412. [Google Scholar] [CrossRef]
- Ming, R.; Abdelrahman, O.; Innab, N.; Ibrahim, M.H.K. Enhancing fraud detection in auto insurance and credit card transactions: A novel approach integrating CNNs and machine learning algorithms. PeerJ Comput. Sci. 2024, 10, e2088. [Google Scholar] [CrossRef]
- Wongpanti, R.; Vittayakorn, S. Enhancing Auto Insurance Fraud Detection Using Convolutional Neural Networks. In Proceedings of the 2024 21st International Joint Conference on Computer Science and Software Engineering (JCSSE), Phuket, Thailand, 19–22 June 2024; pp. 294–301. [Google Scholar]
- Nti, I.K.; Adu, K.; Nimbe, P.; Nyarko-Boateng, O.; Adekoya, A.F.; Appiahene, P. Robust and resourceful automobile insurance fraud detection with multi-stacked LSTM network and adaptive synthetic oversampling. Int. J. Appl. Decis. Sci. 2024, 17, 230–249. [Google Scholar] [CrossRef]
- Wei, S.; Lee, S. Financial anti-fraud based on dual-channel graph attention network. J. Theor. Appl. Electron. Commer. Res. 2024, 19, 297–314. [Google Scholar] [CrossRef]
- Van Driel, H. Financial fraud, scandals, and regulation: A conceptual framework and literature review. In Business History; Taylor and Francis Group: Abingdon, UK, 2019. [Google Scholar]
- Schrijver, G.; Sarmah, D.K.; El-Hajj, M. Automobile Insurance Fraud Detection Using Data Mining: A Systematic Literature Review. In Intelligent Systems with Applications; Elsevier: Amsterdam, The Netherlands, 2024; p. 200340. [Google Scholar]
- Government of Ontario. Ontario Automobile Insurance Anti-Fraud Task Force: Groupe de Travail Antifraude de L’Assurance-Automobile de L’Ontario, Canadian Electronic Library. Canada. Business History. 2012. Available online: https://canadacommons.ca/artifacts/1201133/ontario-automobile-insurance-anti-fraud-task-force/1754253/ (accessed on 8 August 2024).
- Nobel, S.N.; Sultana, S.; Singha, S.P.; Chaki, S.; Mahi, M.J.N.; Jan, T.; Barros, A.; Whaiduzzaman, M. Unmasking Banking Fraud: Unleashing the Power of Machine Learning and Explainable AI (XAI) on Imbalanced Data. Information 2024, 15, 298. [Google Scholar] [CrossRef]
- Urunkar, A.; Khot, A.; Bhat, R.; Mudegol, N. Fraud Detection and Analysis for Insurance Claim using Machine Learning. In Proceedings of the 2022 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), Phuket, Thailand, 19–22 June 2022; Volume 1, pp. 406–411. [Google Scholar]
- Soua, M.; Kachouri, R.; Akil, M. Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing. In Proceedings of the 2015 9th International Symposium on Image and Signal Processing and Analysis (ISPA), Zagreb, Croatia, 7–9 September 2015; pp. 210–215. [Google Scholar]
- Thiprungsri, S.; Vasarhelyi, M.A. Cluster Analysis for Anomaly Detection in Accounting Data: An Audit Approach. Int. J. Digit. Account. Res. 2011, 11, 69–84. [Google Scholar] [CrossRef] [PubMed]
- Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. London Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
- Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417. [Google Scholar] [CrossRef]
- Comon, P. Independent component analysis, a new concept? Signal Process. 1994, 36, 287–314. [Google Scholar] [CrossRef]
- Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411–430. [Google Scholar] [CrossRef] [PubMed]
- He, X.; Cai, D.; Niyogi, P. Laplacian score for feature selection. Adv. Neural Inf. Process. Syst. 2005, 18, 1–8. [Google Scholar]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Kira, K.; Rendell, L.A. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA, USA, 12–16 July 1992; pp. 129–134. [Google Scholar]
- Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of the European Conference on Machine Learning, Catania, Italy, 6–8 April 1994; pp. 171–182. [Google Scholar]
- Robnik-Šikonja, M.; Kononenko, I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef]
- Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Society Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. Berkeley Symp. Math. Statist. Prob. 1967, 1, 281–297. [Google Scholar]
- Wang, S.; Sun, Y.; Bao, Z. On the efficiency of k-means clustering: Evaluation, optimization, and algorithm selection. arXiv 2020, arXiv:2010.06654. [Google Scholar] [CrossRef]
- D’urso, P.; Massari, R. Fuzzy clustering of mixed data. Inf. Sci. 2019, 505, 513–534. [Google Scholar] [CrossRef]
- Qian, Y.; Yao, S.; Wu, T.; Huang, Y.; Zeng, L. Improved Selective Deep-Learning-Based Clustering Ensemble. Appl. Sci. 2024, 14, 719. [Google Scholar] [CrossRef]
- Gan, L.; Allen, G.I. Fast and interpretable consensus clustering via minipatch learning. PLoS Comput. Biol. 2022, 18, e1010577. [Google Scholar] [CrossRef] [PubMed]
- Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
- Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Fisher, R.A. Iris. UCI Machine Learning Repository. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume II, Part II: Probability Theory; University of California Press: Berkeley, CA, USA, 1936. [Google Scholar] [CrossRef]
- Tardu, M.; Rahim, F. Sirtuin6 Small Molecules. UCI Machine Learning Repository. RAIRO Oper. Res. 2016. [Google Scholar] [CrossRef]
- Cardoso, M. Wholesale Customers. UCI Machine Learning Repository. 2013. Available online: https://archive.ics.uci.edu/dataset/292/wholesale+customers (accessed on 1 January 2025). [CrossRef]


















| Data Column Bi-Partition Number: BPW K-Means | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Feature Bi-Partition Unranked | ||||||||||||
| Average Accuracy | ||||||||||||
| 1 | 2 | 3 | 4 | 5 | 6 | |||||||
| 0.1 | 0.50 | 0.00 | 0.50 | 0.00 | 0.50 | 0.00 | 0.53 | 0.01 | 0.52 | 0.01 | 0.54 | 0.02 |
| 0.2 | 0.50 | 0.00 | 0.50 | 0.00 | 0.50 | 0.00 | 0.50 | 0.00 | 0.52 | 0.01 | 0.51 | 0.00 |
| 0.3 | 0.50 | 0.00 | 0.50 | 0.00 | 0.50 | 0.00 | 0.50 | 0.00 | 0.52 | 0.01 | 0.50 | 0.00 |
| 0.4 | 0.50 | 0.00 | 0.50 | 0.00 | 0.50 | 0.00 | 0.58 | 0.05 | 0.58 | 0.05 | 0.58 | 0.05 |
| 0.5 | 0.58 | 0.05 | 0.58 | 0.05 | 0.57 | 0.05 | 0.58 | 0.05 | 0.58 | 0.05 | 0.58 | 0.05 |
| 0.6 | 0.63 | 0.12 | 0.62 | 0.10 | 0.61 | 0.09 | 0.60 | 0.08 | 0.57 | 0.05 | 0.57 | 0.05 |
| 0.7 | 0.77 | 0.17 | 0.68 | 0.16 | 0.60 | 0.14 | 0.63 | 0.09 | 0.58 | 0.05 | 0.58 | 0.05 |
| 0.8 | 0.83 | 0.17 | 0.63 | 0.28 | 0.59 | 0.16 | 0.64 | 0.13 | 0.62 | 0.09 | 0.57 | 0.06 |
| 0.9 | 0.89 | 0.05 | 0.70 | 0.29 | 0.60 | 0.19 | 0.63 | 0.14 | 0.64 | 0.09 | 0.56 | 0.06 |
| Classical K-Mean Result | ||||||||||||
| Mean Standard Deviation | ||||||||||||
| Data Column Bi-Partition Number: BPW K-Means | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Feature Bi-Partition Unranked | ||||||||||||
| Average Accuracy | ||||||||||||
| 1 | 2 | 3 | 4 | 5 | 6 | |||||||
| 0.1 | 0.50 | 0.00 | 0.50 | 0.00 | 0.60 | 0.00 | 0.50 | 0.00 | 0.50 | 0.00 | 0.51 | 0.00 |
| 0.2 | 0.50 | 0.00 | 0.50 | 0.00 | 0.61 | 0.00 | 0.62 | 0.00 | 0.50 | 0.00 | 0.54 | 0.00 |
| 0.3 | 0.50 | 0.00 | 0.50 | 0.00 | 0.59 | 0.00 | 0.62 | 0.00 | 0.62 | 0.00 | 0.61 | 0.00 |
| 0.4 | 0.50 | 0.00 | 0.50 | 0.00 | 0.61 | 0.00 | 0.62 | 0.00 | 0.62 | 0.00 | 0.61 | 0.00 |
| 0.5 | 0.61 | 0.00 | 0.61 | 0.00 | 0.61 | 0.00 | 0.61 | 0.00 | 0.61 | 0.00 | 0.61 | 0.00 |
| 0.6 | 0.78 | 0.00 | 0.23 | 0.00 | 0.61 | 0.00 | 0.61 | 0.00 | 0.61 | 0.00 | 0.61 | 0.00 |
| 0.7 | 0.85 | 0.00 | 0.80 | 0.00 | 0.77 | 0.00 | 0.60 | 0.00 | 0.50 | 0.00 | 0.61 | 0.00 |
| 0.8 | 0.00 | 0.00 | 0.78 | 0.00 | 0.76 | 0.00 | 0.50 | 0.00 | 0.61 | 0.00 | ||
| 0.9 | 0.00 | 0.00 | 0.78 | 0.00 | 0.73 | 0.00 | 0.50 | 0.00 | 0.61 | 0.00 | ||
| Classical K-Mean Result | ||||||||||||
| Mean Standard Deviation | ||||||||||||
| Data Column Bi-Partition Number: BPW K-Means | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Feature Bi-Partition Unranked | ||||||||||||
| Average Rand Index Average Adjusted Rand Index | ||||||||||||
| 1 | 2 | 3 | 4 | 5 | 6 | 1 | 2 | 3 | 4 | 5 | 6 | |
| 0.1 | 0.49 | 0.49 | 0.49 | 0.50 | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 0.2 | 0.49 | 0.49 | 0.49 | 0.49 | 0.50 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 0.3 | 0.49 | 0.49 | 0.49 | 0.49 | 0.50 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 0.4 | 0.49 | 0.49 | 0.49 | 0.52 | 0.52 | 0.52 | 0.00 | 0.00 | 0.00 | 0.04 | 0.04 | 0.04 |
| 0.5 | 0.52 | 0.52 | 0.52 | 0.52 | 0.52 | 0.52 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 |
| 0.6 | 0.65 | 0.65 | 0.64 | 0.64 | 0.52 | 0.52 | 0.30 | 0.30 | 0.28 | 0.28 | 0.04 | 0.04 |
| 0.7 | 0.74 | 0.76 | 0.73 | 0.65 | 0.52 | 0.52 | 0.49 | 0.52 | 0.46 | 0.30 | 0.05 | 0.05 |
| 0.8 | 0.84 | 0.84 | 0.84 | 0.73 | 0.60 | 0.52 | 0.69 | 0.69 | 0.69 | 0.46 | 0.19 | 0.05 |
| 0.9 | 0.84 | 0.74 | 0.65 | 0.52 | 0.69 | 0.49 | 0.30 | 0.05 | ||||
| Classical K-Mean Result | ||||||||||||
| RI = 0.52 ARI = 0.05 | ||||||||||||
| Data Column Bi-Partition Number: BPW K-Means | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Feature Bi-Partition Unranked | ||||||||||||
| Average Rand Index Average Adjusted Rand Index | ||||||||||||
| 1 | 2 | 3 | 4 | 5 | 6 | 1 | 2 | 3 | 4 | 5 | 6 | |
| 0.1 | 0.49 | 0.49 | 0.50 | 0.50 | 0.49 | 0.49 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 |
| 0.2 | 0.49 | 0.49 | 0.50 | 0.50 | 0.50 | 0.50 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.00 |
| 0.3 | 0.49 | 0.49 | 0.50 | 0.50 | 0.50 | 0.51 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.02 |
| 0.4 | 0.49 | 0.49 | 0.51 | 0.51 | 0.51 | 0.51 | 0.00 | 0.00 | 0.02 | 0.03 | 0.03 | 0.02 |
| 0.5 | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.03 | 0.03 | 0.03 | 0.03 | 0.03 | 0.03 |
| 0.6 | 0.56 | 0.55 | 0.51 | 0.51 | 0.51 | 0.51 | 0.12 | 0.11 | 0.03 | 0.03 | 0.03 | 0.03 |
| 0.7 | 0.69 | 0.60 | 0.58 | 0.51 | 0.49 | 0.51 | 0.37 | 0.21 | 0.17 | 0.02 | 0.00 | 0.03 |
| 0.8 | 0.79 | 0.70 | 0.60 | 0.55 | 0.50 | 0.51 | 0.58 | 0.40 | 0.19 | 0.11 | 0.01 | 0.03 |
| 0.9 | 0.59 | 0.56 | 0.49 | 0.51 | 0.19 | 0.12 | 0.00 | 0.03 | ||||
| Classical K-Mean Result | ||||||||||||
| RI = 0.49 ARI = 0.004 | ||||||||||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Combert, F.K.; Xie, S.; Lawniczak, A.T. Bi-Partitioned Feature-Weighted K-Means Clustering for Detecting Insurance Fraud Claim Patterns. Mathematics 2025, 13, 434. https://doi.org/10.3390/math13030434
Combert FK, Xie S, Lawniczak AT. Bi-Partitioned Feature-Weighted K-Means Clustering for Detecting Insurance Fraud Claim Patterns. Mathematics. 2025; 13(3):434. https://doi.org/10.3390/math13030434
Chicago/Turabian StyleCombert, Francis Kwaku, Shengkun Xie, and Anna T. Lawniczak. 2025. "Bi-Partitioned Feature-Weighted K-Means Clustering for Detecting Insurance Fraud Claim Patterns" Mathematics 13, no. 3: 434. https://doi.org/10.3390/math13030434
APA StyleCombert, F. K., Xie, S., & Lawniczak, A. T. (2025). Bi-Partitioned Feature-Weighted K-Means Clustering for Detecting Insurance Fraud Claim Patterns. Mathematics, 13(3), 434. https://doi.org/10.3390/math13030434

