A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds
Abstract
1. Introduction
- Construction of a Legal Supervision Model Based on Real Medical Insurance Data: We construct a novel legal supervision model based on real-world medical insurance data, following the stages of clue generation, group detection, and anomaly judgment. Through systematic data processing and analysis, the model significantly enhances the efficiency and accuracy of medical insurance fraud detection.
- Multi-dimensional Group Aggregation Analysis: To address the group aggregation patterns typical of drug resale fraud, the model conducts a comprehensive analysis across temporal, spatial, and drug similarity dimensions. By combining group-level aggregation features with individual-level outliers, it achieves more accurate identification of potential fraud risks, thereby ensuring the security of medical insurance funds.
- Adaptive Risk Stratification Based on Clustering Methods: In contrast to traditional approaches that rely on expert-defined rules to determine anomalies, the proposed model utilizes clustering algorithms to automatically generate decision thresholds. This data-driven strategy eliminates subjective biases introduced by expert-defined rules and improves the model’s adaptability to diverse data distributions.
2. Related Work
3. Methods
3.1. Multi-Dimensional Clue Generation
3.2. Spatio-Temporal Group Anomaly Analysis
3.3. Adaptive Risk Stratification Assessment
Algorithm 1 Risk Stratification Assessment Model |
|
4. Results
4.1. Experimental Details
4.2. Experimental Results
- KMeans assigns proximate samples to the same subset, minimizing intra-subset dissimilarity while maximizing inter-subset dissimilarity. It starts by randomly selecting K initial cluster centers, calculates the distance from each sample point to these centers, and assigns each point to the nearest center. The centers are then iteratively updated.
- HDBSCAN [28] is based on density-based clustering concepts, defining clusters as high-density regions separated by low-density areas. It uses core distance and mutual reachability distance to describe data point connectivity, avoiding the need to specify the number of clusters in advance. It constructs a density contour tree by analyzing the hierarchy of merging clusters as the density threshold decreases, and simplifies this tree using a minimum cluster size to obtain the final clustering result.
- XGBoost [31] is based on a gradient boosting framework, constructs additive regression trees to minimize prediction loss by iteratively fitting residual errors. It uses second-order gradient information for node splitting, introduces regularization terms to prevent overfitting and supports parallel computation for efficiency. The algorithm starts with an initial constant prediction, and each new tree aims to correct the errors of the previous ensemble. We use specific parameters and adjustment ranges, as shown in Table 3.
Parameter Name | Value | Range |
---|---|---|
max depth | 5 | 3–10 |
learning rate | 0.01 | 0.01–0.2 |
subsample | 0.8 | 0.5–1.0 |
colsample bytree | 0.8 | 0.5–1.0 |
estimators | 50 | 50–200 |
4.3. Ablation Studies
- Without multi-dimension: To validate the effectiveness of our proposed multi-dimensional clue generation module, we designed corresponding ablation studies. Specifically, we conducted experiments using only individual dimensions (frequency, cost, and behavioral dimensions), as well as their pairwise combinations for clue generation, thereby demonstrating the necessity of each rule dimension.
- Without GA: To assess the effectiveness of Group Anomaly analysis proposed in Section 3.2, we only apply anomaly analysis on a single card in multi-dimensional clue generation. Additionally, to demonstrate the necessity of combining temporal and spatial dimensions in our proposed group anomaly detection approach, we designed ablation experiments using solely the spatial dimension or the temporal dimension in isolation.
- Without indicator aggregating: To validate the effectiveness of our proposed indicator aggregation module, we conducted an ablation study by removing the EWM-TOPSIS component, where the statistically derived anomaly scores were directly fed into the subsequent FLASC algorithm for threshold generation.
- Without adaptive threshold: To demonstrate the necessity of the adaptive threshold generation module, we set the threshold to a fixed value corresponding to the top 10% of statistical scores.
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Najar, A.V.; Alizamani, L.; Zarqi, M.; Hooshmand, E. A global scoping review on the patterns of medical fraud and abuse: Integrating data-driven detection, prevention, and legal responses. Arch. Public Health 2025, 83, 43. [Google Scholar] [CrossRef]
- Wang, Z.; Chen, X.; Wu, Y.; Jiang, L.; Lin, S.; Qiu, G. A robust and interpretable ensemble machine learning model for predicting healthcare insurance fraud. Sci. Rep. 2025, 15, 218. [Google Scholar] [CrossRef]
- Safitri, A.; Nurcihikita, T. The analysis of the implementation of the national health insurance fraud prevention program. J. Health Manag. Adm. Public Health Policies HealthMAPs 2024, 2, 52–63. [Google Scholar] [CrossRef]
- Hamid, Z.; Khalique, F.; Mahmood, S.; Daud, A.; Bukhari, A.; Alshemaimri, B. Healthcare insurance fraud detection using data mining. BMC Med. Inform. Decis. Mak. 2024, 24, 112. [Google Scholar] [CrossRef]
- Thornton, D.; Brinkhuis, M.; Amrit, C.; Aly, R. Categorizing and describing the types of fraud in healthcare. Procedia Comput. Sci. 2015, 64, 713–720. [Google Scholar] [CrossRef]
- Peng, J.; Li, Q.; Li, H.; Liu, L.; Yan, Z.; Zhang, S. Fraud Detection of Medical Insurance Employing Outlier Analysis. In Proceedings of the 2018 IEEE 22nd International Conference on Computer Supported Cooperative Work in Design (CSCWD), Nanjing, China, 9–11 May 2018; pp. 341–346. [Google Scholar]
- Mao, Y.; Li, Y.; Xu, B.; Han, J. XGAN: A Medical Insurance fraud Detector based on GAN with XGBoost. J. Inf. Hiding Multim. Signal Process. 2024, 15, 36–52. [Google Scholar]
- Zhang, R.; Cheng, D.; Yang, J.; Ouyang, Y.; Wu, X.; Zheng, Y.; Jiang, C. Pre-trained online contrastive learning for insurance fraud detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2024; Volume 38, pp. 22511–22519. [Google Scholar]
- Alam, M.S.; Rai, P.; Tiwari, R.K. Machine Learning for Healthcare Fraud Detection: A Comprehensive Review Literature. In Leveraging Futuristic Machine Learning and Next-Generational Security for e-Governance; IGI Global Scientific Publishing: Hershey, PA, USA, 2025; pp. 229–254. [Google Scholar]
- Wang, J.; Guo, Y.; Wen, X.; Wang, Z.; Li, Z.; Tang, M. Improving graph-based label propagation algorithm with group partition for fraud detection. Appl. Intell. 2020, 50, 3291–3300. [Google Scholar] [CrossRef]
- Ma, J.; Zhang, D.; Wang, Y.; Zhang, Y.; Pozdnoukhov, A. GraphRAD: A graph-based risky account detection system. In Proceedings of the ACM SIGKDD Conference, London, UK, 19–23 August 2018; Volume 9. [Google Scholar]
- Tan, X.; Yang, J.; Zhao, Z.; Xiao, J.; Li, C. Improving Graph Convolutional Network with Learnable Edge Weights and Edge-Node Co-Embedding for Graph Anomaly Detection. Sensors 2024, 24, 2591. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef]
- Arockiam, J.M.; Pushpanathan, A.C.S. MapReduce-iterative support vector machine classifier: Novel fraud detection systems in healthcare insurance industry. Int. J. Electr. Comput. Eng. IJECE 2023, 13, 756. [Google Scholar] [CrossRef]
- Kumaraswamy, N.; Ekin, T.; Park, C.; Markey, M.K.; Barner, J.C.; Rascati, K. Using a Bayesian Belief Network to detect healthcare fraud. Expert Syst. Appl. 2024, 238, 122241. [Google Scholar] [CrossRef]
- Nalluri, V.; Chang, J.R.; Chen, L.S.; Chen, J.C. Building prediction models and discovering important factors of health insurance fraud using machine learning methods. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 9607–9619. [Google Scholar] [CrossRef]
- Van Capelleveen, G.; Poel, M.; Mueller, R.M.; Thornton, D.; van Hillegersberg, J. Outlier detection in healthcare fraud: A case study in the Medicaid dental domain. Int. J. Account. Inf. Syst. 2016, 21, 18–31. [Google Scholar] [CrossRef]
- De Meulemeester, H.; De Smet, F.; van Dorst, J.; Derroitte, E.; De Moor, B. Explainable unsupervised anomaly detection for healthcare insurance data. BMC Med. Inform. Decis. Mak. 2025, 25, 14. [Google Scholar] [CrossRef]
- Xu, H.; Pang, G.; Wang, Y.; Wang, Y. Deep Isolation Forest for Anomaly Detection. IEEE Trans. Knowl. Data Eng. 2023, 35, 12591–12604. [Google Scholar] [CrossRef]
- Islam Prova, N.N. Healthcare Fraud Detection Using Machine Learning. In Proceedings of the 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), Coimbatore, India, 28–30 August 2024; pp. 1119–1123. [Google Scholar]
- Mohammed, M.A.; Boujelben, M.; Abid, M. A Novel Approach for Fraud Detection in Blockchain-Based Healthcare Networks Using Machine Learning. Future Internet 2023, 15, 250. [Google Scholar] [CrossRef]
- Xiao, F.; Li, H.X.; Wang, X.K.; Wang, J.Q.; Chen, S.X. Predictive analysis for healthcare fraud detection: Integration of probabilistic model and interpretable machine learning. Inf. Sci. 2025, 719, 122499. [Google Scholar] [CrossRef]
- Sinaga, K.P.; Yang, M.S. Unsupervised K-means clustering algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
- Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
- Singh, H.V.; Girdhar, A.; Dahiya, S. A Literature survey based on DBSCAN algorithms. In Proceedings of the 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 25–27 May 2022; pp. 751–758. [Google Scholar]
- Tang, C.; Wang, H.; Wang, Z.; Zeng, X.; Yan, H.; Xiao, Y. An improved OPTICS clustering algorithm for discovering clusters with uneven densities. Intell. Data Anal. 2021, 25, 1453–1471. [Google Scholar] [CrossRef]
- Kanagala, H.K.; Krishnaiah, V.J.R. A comparative study of K-Means, DBSCAN and OPTICS. In Proceedings of the 2016 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 7–9 January 2016; pp. 1–6. [Google Scholar]
- Campello, R.J.G.B.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data 2015, 10, 5. [Google Scholar] [CrossRef]
- Stewart, G.; Al-Khassaweneh, M. An implementation of the HDBSCAN* clustering algorithm. Appl. Sci. 2022, 12, 2405. [Google Scholar] [CrossRef]
- Bot, D.M.; Peeters, J.; Liesenborgs, J.; Aerts, J. FLASC: A flare-sensitive clustering algorithm. PeerJ Comput. Sci. 2025, 11, e2792. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ’16: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Parikh, D.; Radadia, S.; Eranna, R.K. Privacy-Preserving Machine Learning Techniques, Challenges and Research Directions. Int. Res. J. Eng. Technol. 2024, 11, 499. [Google Scholar]
- Miller, S. Machine learning, ethics and law. Australas. J. Inf. Syst. 2019, 23, 1–13. [Google Scholar] [CrossRef]
- Galiana, L.I.; Gudino, L.C.; González, P.M. Ethics and artificial intelligence. Rev. Clín. Esp. Engl. Ed. 2024, 224, 178–186. [Google Scholar]
- Chen, Y.; Ni, T.; Xu, W.; Gu, T. SwipePass: Acoustic-based Second-factor User Authentication for Smartphones. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 106. [Google Scholar] [CrossRef]
- Kong, J.; Song, X.; Huai, S.; Xu, B.; Luo, J.; He, Y. Do Not DeepFake Me: Privacy-Preserving Neural 3D Head Reconstruction Without Sensitive Images. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 39, pp. 4383–4391. [Google Scholar]
- Duan, D.; Sun, Z.; Ni, T.; Li, S.; Jia, X.; Xu, W.; Li, T. F2Key: Dynamically Converting Your Face into a Private Key Based on COTS Headphones for Reliable Voice Interaction. In MOBISYS ’24: Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services, Tokyo, Japan, 3–7 June 2024; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar]
Dimension | Rule |
---|---|
Medical Frequency | Monthly OP count ≥15 |
≥4 daily consultations for ≥3 days | |
Annual OP count >100 | |
Medical Cost | Monthly OP expenses ≥5000 RMB |
Annual OP + EM expenses ≥25,000 RMB | |
Annual total insurance >30,000 RMB, | |
drug >80%, inspection <10% | |
Medical Behavior | Same drug at ≥3 institutions in 1 week |
>10 drug types at multiple institutions in 1 week |
Field Name | Anonymization Strategy |
---|---|
Patient/Physician name | Retain only the lastname |
ID number | Retain the first 6 digits and digits 7 to 10 |
Home address | Remove |
Medical Insurance Card Number | Map to a random string of characters |
Institution Code | Map to a random string of characters |
Method | Precision | Recall | Accuracy | F1 | TP | FP | TN | FN |
---|---|---|---|---|---|---|---|---|
Our Method | 0.89 | 0.42 | 0.87 | 0.57 | 748 | 92 | 7043 | 1034 |
HDBSCAN | 0.73 | 0.34 | 0.84 | 0.47 | 611 | 226 | 6909 | 1171 |
Kmeans | 0.73 | 0.38 | 0.85 | 0.50 | 674 | 251 | 6884 | 1108 |
XGBoost | 0.82 | 0.41 | 0.86 | 0.55 | 730 | 160 | 6975 | 1052 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, Q.; Ding, Q.; Zheng, C.; Pan, L.; Liu, N.; Li, W. A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds. Electronics 2025, 14, 3268. https://doi.org/10.3390/electronics14163268
He Q, Ding Q, Zheng C, Pan L, Liu N, Li W. A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds. Electronics. 2025; 14(16):3268. https://doi.org/10.3390/electronics14163268
Chicago/Turabian StyleHe, Qingyang, Qi Ding, Conghui Zheng, Li Pan, Ning Liu, and Wensheng Li. 2025. "A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds" Electronics 14, no. 16: 3268. https://doi.org/10.3390/electronics14163268
APA StyleHe, Q., Ding, Q., Zheng, C., Pan, L., Liu, N., & Li, W. (2025). A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds. Electronics, 14(16), 3268. https://doi.org/10.3390/electronics14163268