Methodology for Detecting Suspicious Claims in Health Insurance Using Supervised Machine Learning
Abstract
1. Introduction
2. Materials and Methods
2.1. Phase 1—Identifying Signs of Fraud
2.2. Phase 2—Identification of Available Manifestations and Factors
2.3. Phase 3—Data Preprocessing and Balancing
3. Results
3.1. Phase 4—Model Development, Training, and Evaluation
3.1.1. Confusion Matrix
3.1.2. Loss Function
3.1.3. Results of the Metrics
4. Discussion
4.1. About the Methodology
4.2. Regarding the Case Study: SIS
4.3. Limitations
5. Conclusions
Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| HIF | Health insurance fraud |
| SML | Supervised machine learning |
| PDHIF | Phases for Detecting Fraud in Health Insurance |
| CRISP-DM | Cross-sector standard process for data mining |
| ICD-10 | Classification of Diseases, 10th Revision |
| CPT | Current procedural terminology |
| SMOTE | Synthetic minority oversampling technique |
| FUA | Single care forms |
| PSA | Automatic supervision |
| SME | Electronic medical supervision |
| PCPP | Post-appointment onsite control |
References
- Joudaki, H.; Rashidian, A.; Minaei-Bidgoli, B.; Mahmoodi, M.; Geraili, B.; Nasiri, M.; Arab, M. Improving fraud and abuse detection in general physician claims: A data mining study. Int. J. Health Policy Manag. 2016, 5, 165–172. [Google Scholar] [CrossRef]
- U.S. Department of Justice Justice Manual. 976. Health Care Fraud—Generally. United States Department of Justice. Available online: https://www.justice.gov/archives/jm/criminal-resource-manual-976-health-care-fraud-generally (accessed on 6 April 2025).
- Shrank, W.H.; Rogstad, T.L.; Parekh, N. Waste in the US Health Care System: Estimated Costs and Potential for Savings. JAMA 2019, 322, 1501. [Google Scholar] [CrossRef]
- Kose, I.; Gokturk, M.; Kilic, K. An interactive machine-learning-based electronic fraud and abuse detection system in healthcare insurance. Appl. Soft Comput. J. 2015, 36, 283–299. [Google Scholar] [CrossRef]
- Kirlidog, M.; Asuk, C. A fraud detection approach with data mining in health insurance. Procedia Soc. Behav. Sci. 2012, 62, 989–994. [Google Scholar] [CrossRef]
- Shin, H.; Park, H.; Lee, J.; Jhee, W.C. A Scoring Model to Detect Abusive Billing Patterns in Health Insurance Claims. Expert. Syst. Appl. 2012, 39, 7441–7450. [Google Scholar] [CrossRef]
- Ahmed, M.; Ahamad, M.; Jaiswal, T. Augmenting Security and Accountability Within the eHealth Exchange. IBM J. Res. Dev. 2014, 58, 8. [Google Scholar] [CrossRef]
- Phua, C.; Alahakoon, D.; Lee, V. Minority report in fraud detection: Classification of skewed data. Acm Sigkdd Explor. Newsl. 2004, 6, 50–59. [Google Scholar] [CrossRef]
- Travaille, P.; Müller, R.M.; Thornton, D.; Hillegersberg, J. Electronic fraud detection in the US medicaid healthcare program: Lessons learned from other industries. In Proceedings of the 17th AMCIS 2011, Detroit, MI, USA, 4–8 August 2011; Available online: http://doc.utwente.nl/78000/ (accessed on 15 October 2016).
- Villegas-Ortega, J.; Bellido-Boza, L.; Mauricio, D. Fourteen years of manifestations and factors of health insurance fraud, 2006–2020: A scoping review. Health Justice 2021, 9, 26. [Google Scholar] [CrossRef] [PubMed]
- Shimaoka, A.M.; Ferreira, R.C.; Goldman, A. The evolution of CRISP-DM for data science: Methods, processes and frameworks. SBC Rev. Comput. Sci. 2024, 4, 28–43. [Google Scholar] [CrossRef]
- Schröer, C.; Kruse, F.; Gómez, J.M. A systematic literature review on applying CRISP-DM process model. Procedia Comput. Sci. 2021, 181, 526–534. [Google Scholar] [CrossRef]
- Ortega, P.A.; Figueroa, C.J.; Ruz, G.A. A Medical Claim Fraud/Abuse Detection System based on Data Mining: A Case Study in Chile. DMIN 2006, 6, 26–29. [Google Scholar]
- Liou, F.-M.; Tang, Y.-C.; Chen, J.-Y. Detecting hospital fraud and claim abuse through diabetic outpatient services. Health Care Manag. Sci. 2008, 11, 353–358. [Google Scholar] [CrossRef]
- Mailloux, A.T.; Cummings, S.W.; Mugdh, M. A decision support tool for identifying abuse of controlled substances by forwardhealth medicaid members. J. Hosp. Mark. Public Relat. 2010, 20, 34–55. [Google Scholar] [CrossRef]
- Francis, C.; Pepper, N.; Strong, H. Using support vector machines to detect medical fraud and abuse. In Proceedings of the 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Boston, MA, USA, 30 August–3 September 2011; pp. 8291–8294. [Google Scholar]
- Bauder, R.A.; Khoshgoftaar, T.M.; Hasanin, T. Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection. In Proceedings of the 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), Volos, Greece, 5–7 November 2018; IEEE: Volos, Greece, 2018; pp. 137–142. [Google Scholar]
- Johnson, J.M.; Khoshgoftaar, T.M. Data-Centric AI for Healthcare Fraud Detection. SN Comput. Sci. 2023, 4, 389. [Google Scholar] [CrossRef]
- Hancock, J.T.; Bauder, R.A.; Wang, H.; Khoshgoftaar, T.M. Explainable machine learning models for Medicare fraud detection. J. Big Data 2023, 10, 154. [Google Scholar] [CrossRef]
- Nabrawi, E.; Alanazi, A. Fraud Detection in Healthcare Insurance Claims Using Machine Learning. Risks 2023, 11, 160. [Google Scholar] [CrossRef]
- Prova, N. Healthcare Fraud Detection Using Machine Learning 2024. In Proceedings of the 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), Coimbatore, India, 20–30 August 2024. [Google Scholar] [CrossRef]
- Bounab, R.; Zarour, K.; Guelib, B.; Khlifa, N. Enhancing Medicare Fraud Detection Through Machine Learning: Addressing Class Imbalance With SMOTE-ENN. IEEE Access 2024, 12, 54382–54396. [Google Scholar] [CrossRef]
- Mao, Y.; Li, Y.; Xu, B.; Han, J. XGAN: A Medical Insurance fraud Detector based on GAN with XGBoost. J. Inf. Hiding Multim. Signal Process 2024, 15, 36–52. [Google Scholar]
- Cressey, D.R. Other People’s Money; A Study of the Social Psychology of Embezzlement; The Free Press: Los Angeles, CA, USA, 1953. [Google Scholar]
- Wolfe, D.T.; Hermanson, D.R. The fraud diamond: Considering the four elements of fraud. CPA J. 2004, 74, 38–42. [Google Scholar]
- Kranacher, M.-J.; Riley, R. Forensic Accounting and Fraud Examination; John Wiley & Sons: Hoboken, NJ, USA, 2019; Available online: https://books.google.com/books?hl=es&lr=&id=GnOODwAAQBAJ&oi=fnd&pg=PR12&dq=Forensic+Accounting+and+Fraud+Examination,+&ots=PMN4s72CCa&sig=9obhV0dZZK1s4MkAvFO_a6fVdl4 (accessed on 14 April 2025).
- SIS. Boletín Estadístico 2024 del Seguro Integral de Salud (SIS). Available online: https://cdn.www.gob.pe/uploads/document/file/7530499/6401616-boletin-estadistico-2024.pdf?v=1737643909 (accessed on 21 June 2025).
- MEF. Consulta Amigable del Ministerio de Economía y Finanzas del Perú; MEF: Lima, Peru, 2023; Available online: https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=ActProy (accessed on 14 May 2025).
- Espinoza Rivera, S. Estrategias Implementadas por el Seguro Integral de Salud y su Influencia en las Transferencias Financieras y su Ejecución por Parte de los Hospitales Nacionales e Institutos Especializados, Lima–Peru, 2009–2017. Master’s Thesis, Universidad Nacional Federico Villarreal, Lima, Peru, 2019. Available online: https://repositorio.unfv.edu.pe/handle/20.500.13084/3121 (accessed on 14 February 2025).
- Espinal Redondez, L.Á.; Ibáñez Alvarado, C.M.; Moyano Melo, M.A.J.A. Propuesta de un Modelo Predictivo para Realizar un Control y Supervisión más Eficiente de las Prestaciones de Servicios de Salud en una Aseguradora Pública de Salud. Master’s Thesis, Universidad Peruana de Ciencias Aplicadas (UPC), Lima, Peru, 2020. Available online: https://repositorioacademico.upc.edu.pe/handle/10757/652194 (accessed on 14 February 2025).
- Galagarza Ruíz, G.I. Validación Prestacional Oportuna de las Prestaciones del Servicio de Cuidados Intensivos de un Hospital nivel III-I Periodo 2012–2014. Master’s Thesis, Universidad de San Martín de Porres, Lima, Peru, 2015. [Google Scholar]
- Quispe Mamani, J.C.; Quilca Soto, Y.; Calcina Álvarez, D.A.; Yapuchura Saico, C.R.; Ulloa Gallardo, N.J.; Aguilar Pinto, S.L.; Quispe Quispe, B.; Quispe Maquera, N.B.; Cutipa Quilca, B.E. Moral Risk in the Behavior of Doctors of the Comprehensive Health Insurance in the Province of San Román, Puno-Peru, 2021. Front. Public Health 2022, 9, 799708. [Google Scholar] [CrossRef] [PubMed]
- Encuesta Perú 21—Ipsos Inseguridad Ciudadana en Perú. Available online: https://www.ipsos.com/es-pe/inseguridad-ciudadana-en-peru-encuesta-peru-21-ipsos-febrero-2025 (accessed on 8 May 2025).
- INEI. Perú Delincuencia y Corrupción son los Principales Problemas que Afectan al País. Available online: https://m.inei.gob.pe/prensa/noticias/delincuencia-y-corrupcion-son-los-principales-problemas-que-afectan-al-pais-9294/ (accessed on 8 May 2025).
- Algalobo Távara, B.P.; Espinoza Sánchez, N.A. La corrupción y su relación con los índices de pobreza extrema en el Perú. Rev. InveCom 2025, 5. Available online: https://ve.scielo.org/scielo.php?script=sci_arttext&pid=S2739-00632025000102086 (accessed on 8 May 2025). [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, X. Questionable University-sponsored supplements in high-impact journals. Scientometrics 2015, 105, 1985–1995. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]




| Autor | Country|Database (Size) | Algorithm | Balancing Method | Accuracy | Precision | AUC | Recall | Specificity | F1 Score | DM Methodology | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| [13] | Chile|Banmédica SA, claims (500,000), abusive (169). | ANN–MLP | NR | NR | NR | 82.0% | 73.4% | NR | 91.0% | Ad hoc | |
| [14] | Taiwan|NHI (1,050,979 patients y 17,668 providers) | LR | NR | 84.6% | NR | Ad hoc | |||||
| ANN | 91.5% | ||||||||||
| DT | 98.7% | ||||||||||
| [15] | US|Medicaid US Wisconsin (Medication for 190 beneficiaries) | DT-CHAID | NR | 95.3% | 91.9% | NR | 87.2% | 96.5% | NR | Ad hoc | |
| [16] | US|Medicare y Medicaid (182,809 invoices) | SVM | NR | 99.0% | NR | NR | NR | NR | NR | Ad hoc | |
| [6] | Republic of Korea|HIRA (3075 internal medicine providers) | DT-CDA Score | NR | Ad hoc | |||||||
| [17] | US|Medicare Part A, B, D (759,267 nonfraud, 473 fraud) | LR | SMOTE | NR | 82.8% | NR | Ad hoc | ||||
| RF | RUS | 82.8% | |||||||||
| XGB | RUS | 81.7% | |||||||||
| [18] | US|Nine datasets from Medicare part B, part D, DMEPOS. | XGB | NR | 95.4% | 84.9% | 95.7% | NR | CRISP-DM | |||
| RF | 87.2% | NR | |||||||||
| XGB | 95.9% | 86.2% | 96.8% | ||||||||
| RF | 80.2% | NR | |||||||||
| XGB | 95.0% | 85.5% | 96.9% | ||||||||
| RF | 83.8% | NR | |||||||||
| [19] | US | Medicare Part D (5,344,106 instances, 0.07% fraud) | CATB | NR | 95.7% | NR | Ad hoc | ||||
| XGB | 95.6% | ||||||||||
| RF | 86.4% | ||||||||||
| LGBM | 84.8% | ||||||||||
| ET | 84.1% | ||||||||||
| LR | 83.5% | ||||||||||
| Medicare Part B (8,669,497 instances, 0.05% fraud) | CATB | NR | 95.9% | NR | |||||||
| XGB | 94.7% | ||||||||||
| RF | 84.6% | ||||||||||
| LGBM | 85.0% | ||||||||||
| ET | 86.1% | ||||||||||
| LR | 91.4% | ||||||||||
| [20] | Saudi Arabia|Three providers from 2022 | RF | SMOTE | 98.2% | 98.1% | 90.0% | 100.0% | 80.0% | 99.0% | Ad hoc | |
| LR | SMOTE | 80.4% | 97.6% | 80.2% | 80.4% | 80.0% | 88.2% | ||||
| ANN | SMOTE | 94.6% | 98.0% | 88.0% | 96.1% | 80.0% | 97.0% | ||||
| [21] | US|Exclusive dataset (558,000). | RF | NR | 92.4% | 95.6% | 90.7% | 83.8% | NR | 89.3% | Ad hoc | |
| XGB | 91.7% | 97.4% | 96.2% | 80.4% | 88.1% | ||||||
| SVM | 81.8% | 81.6% | 87.6% | 67.1% | 73.7% | ||||||
| IF | 62.6% | 51.5% | 40.6% | 21.9% | 30.8% | ||||||
| DLM | 91.7% | 94.8% | 90.9% | 65.1% | 77.2% | ||||||
| SEM | 92.8% | 93.6% | 97.0% | 86.9% | 90.2% | ||||||
| [22] | US|Medicare US Part B (9,449,361) | LR | SMOTE-ENN and cross-validation | 65.0% | 69.0% | 73.0% | 67.0% | NR | 65.0% | Ad hoc | |
| DT | 100.0% | 99.0% | 95.0% | 100.0% | 100.0% | ||||||
| RF | 95.0% | 95.0% | 99.0% | 95.0% | 95.0% | ||||||
| XGB | 96.0% | 96.0% | 99.0% | 96.0% | 96.0% | ||||||
| AdaGB | 65.0% | 70.0% | 68.0% | 67.0% | 64.0% | ||||||
| LGBM | 91.0% | 90.0% | 97.0% | 91.0% | 91.0% | ||||||
| [23] | China|Tianchi Precision Social Security Competition: Real Health Insurance Data (NR) | LR | D.Samp. | NR | 95.6% | NR | 73.0% | NR | 82.8% | Ad hoc | |
| XGB | D.Samp. | 81.1% | 86.9% | 83.9% | |||||||
| LR | SMOTE | 95.8% | 78.3% | 86.2% | |||||||
| XGB | SMOTE | 91.2% | 90.3% | 90.8% | |||||||
| LR | BSMOTE | 95.6% | 80.5% | 87.4% | |||||||
| XGB | BSMOTE | 92.6% | 86.9% | 89.6% | |||||||
| LR | GAN | 93.8% | 72.8% | 81.9% | |||||||
| XGB | GAN | 94.4% | 96.4% | 95.4% | |||||||
| Phase | Description |
|---|---|
| Phase 1. Identifying signs of fraud | First, we consider the theoretical framework. According to the definition of SML [10], fraud cannot be asserted without verifying intent and the obtaining of illegal benefits. In other words, “HIF is an act based on intentional deception or misrepresentation to obtain illegal benefits related to the coverage provided by health insurance.” Since intent cannot be determined at this stage, it is possible to analyze noncompliance with the regulatory framework and local regulations, which could generate suspicion. This suspicion could correspond to potential fraud. Second, we consider the local socioeconomic dynamics specific to the health sector and their direct influence on SML. To do this, we must analyze the elements defined in fraud theory, using the Fraud Triangle [24], the Fraud Diamond [25], and MICE (Minimum Incentives, Comprehensive Care, and Enterprises) [26], COSO (ERM), among others †. |
| Phase 2. Identifying available manifestations and factors | The manifestations of adverse health effects (AHS) are associated with the types of actors involved ‡, considering the study by Villegas-Ortega et al. (2021) [10] and the legal framework. This delimitation facilitates a more precise and operational analysis, focused on evidence rather than empirical or abstract categories, identifying concrete patterns according to the actor’s profile: in the case of insurance companies (2 possible manifestations), policyholders (7), or medical providers (12). Similarly, the 47 AHS factors § SML [10] available in transactional (claims, invoices) and/or nontransactional (regulatory reports, surveys) data are identified. This phase ensures the coverage of the factors (variables) associated with AHS in the case study. |
| Phase 3. Data preprocessing and balancing | Data are prepared for analysis through a process that includes cleaning (duplicate removal, error correction, and handling of missing values), transformation (categorical variable coding and format standardization), and normalization (data scaling to ensure comparability). Additionally, transactions are labeled as fraudulent or legitimate, and the compatibility of clinical standards (ICD-10, CPT) is validated across heterogeneous systems. Because SML datasets are often unbalanced—with an SMLll proportion of fraudulent cases compared to legitimate ones—traditional techniques such as overrandom sampling or undersampling are avoided due to their risk of overfitting or loss of information. Instead, advanced methods such as SMOTE or others are applied, combined with other high-dimensionality techniques, allowing for category balancing and optimizing the model’s predictive capacity. |
| Phase 4. Model development, training, and evaluation | In this phase, the model is trained using SML, with a stratified division of the collected data (Phase 3) into training, validation, and test sets. The process includes cross-validation to ensure robustness and hyperparameter fitting, optimizing metrics for unbalanced problems such as AUC-ROC (to assess discriminatory capacity between classes), Recall (maximizing the detection of true positives), and F1-Score (balancing accuracy and completeness). The final evaluation is performed with the test set, ensuring that the reported metrics reflect real-world performance with unseen data, thus validating the model’s effectiveness in identifying fraud patterns under real-world operating conditions. |
| Phase 5. Deployment and Operational Integration | The model is implemented in a production environment using a scalable architecture that supports both real-time and batch processing, ensuring its integration with existing systems. To maximize its operational utility, automated workflows are established where transactions classified as suspicious are routed to the specialized investigation team (auditors, tax officials, and SML specialists) to confirm fraud, prioritizing cases with a higher probability of fraud. In addition, security protocols (data encryption, access control) are implemented, and best practices for fraud detection are established and implemented. |
| Phase 6. Adaptive Monitoring and Evolution | A continuous monitoring system for the SML detection model’s performance is implemented, with automated alerts on key metrics (F1 score, recall, false positive rate) to ensure the model remains current in identifying new fraud patterns. To this end, periodic model retraining cycles are performed, incorporating new factors or manifestations based on regulatory changes, as well as false positive/negative reports from the research team. The integration of these components guarantees that the system evolves in tandem with fraudulent tactics and regulatory frameworks, preserving its accuracy in dynamic healthcare environments. |
| (a) | ||||
| Manifestation | Justification | Result | ||
| M10 | This involves billing for services under diagnostic and procedure codes that are more expensive than those actually performed. The PSA detects this through automated validation rules, and medical appropriateness is determined in the SME; both are linked to the FUA. | ELIGIBLE | ||
| M11 | The SME identifies cases by detecting FUAs without support in clinical or surgical reports, medical signatures or laboratory results; if such a situation is found, the FUA is marked as a charge for procedures never performed. | ELIGIBLE | ||
| M12 | The PCPP assesses the consistency between the diagnosis and the reported treatment. These irregularities are marked in the FUAs as observations, which may indicate fraud, making the individual a candidate for the SML. | ELIGIBLE | ||
| M13 | None of the SIS audit processes suggest the identification of suppliers with possible bribes or illegal commissions, nor are they marked in the FUAS because illegitimate payments cannot be identified, so this statement is not selected. | UNELIGIBLE | ||
| M14 | Through automatic consistency rules, the PSA blocks these cases by detecting repetitive charges for separate procedures, which is complemented by the SME, since it considers the analysis of specialized clinical cases, so there are records marked with such observations, which justifies including it. | ELIGIBLE | ||
| M15 | The PCPP verifies inconsistencies in dates, seals, or protocols by comparing clinical records with validated standards; however, it does not record the falsification of certificates, medical records, or alteration of documents to justify payments. | UNELIGIBLE | ||
| M16 | The SME detects anomalies such as multiple diagnostic support procedures in short periods, cross-referencing data with medical records to validate their justification, marking the FUAS with unjustified services or overutilization. | ELIGIBLE | ||
| M17 | The PSA, with automated validation, identifies atypical claims (e.g., oxygen registered in liters instead of cubic meters) by comparing with historical profiles of patients and providers; however, with the FUAS, it cannot be guaranteed that it is an opportunistic fraud. | UNELIGIBLE | ||
| M18 | The PSA applies consistency rules, blocking duplicate FUAS for the same patient during overlapping periods, thus validating the uniqueness of procedures. However, no FUAS records are flagged. | UNELIGIBLE | ||
| M19 | None of the SIS audit processes analyzes readmissions or unnecessary admissions or repeated hospitalizations without clinical improvement. | UNELIGIBLE | ||
| M20 | Charges for unused room types, billing for stays in unassigned rooms. The SIS does not monitor charges for unused room types or billing for stays in unassigned rooms. | UNELIGIBLE | ||
| M21 | The SIS does not foresee canceled services with discounts in its audit processes. | UNELIGIBLE | ||
| (b) | ||||
| Category | Factor of SML | Available Variables | Type | Selected Variables |
| Key identifiers | V01. Anonymized FUA number | Alphanumeric | Yes | |
| V02. FUA Identifier | Numeric | No | ||
| V03. Year of production | Numeric | No | ||
| V04. Production month | Alphanumeric | No | ||
| Demographic Data | F22. Gender | V05. Patient’s sex | Numeric | Yes |
| F23. Age | V06. Patient’s age | Numeric | Yes | |
| Supplier details | V07. SIS Macro-Regional Management | Numeric | No | |
| V08. UDR | Numeric | No | ||
| V09. Department | Numeric | No | ||
| V10. Implementing Unit | Numeric | No | ||
| F7. Supplier details | V11. Supplier code | Alphanumeric | Yes | |
| V12. Supplier category | Alphanumeric | No | ||
| Service Details | V13. Service provision | Alphanumeric | Yes | |
| V14. Destination at high | Numeric | Yes | ||
| V15. Date of service | Temporary | No | ||
| V16. Hospital start date | Temporary | No | ||
| V17. Date of hospital discharge | Temporary | No | ||
| F31. Specialty | V18. Type of Care | Numeric | Yes2 | |
| V19. Hospital Stay | Numeric | Yes2 | ||
| F29. Diagnosis F33. Chronic health condition | V20. ICD-10 Primary Diagnosis | Alphanumeric | No | |
| V21. Diagnostic Profile: ICD-10 | Numeric | Yes2 | ||
| V22. Number of diagnoses | Numeric | Yes2 | ||
| V23. Healthcare Personnel | Alphanumeric | Yes | ||
| Economic variables | F18. Refund processes and billing features | V24. Gross valuation of the FUA | Numeric | No |
| V25. Gross Value of Procedures | Numeric | Yes | ||
| V26. Gross value of medicines | Numeric | Yes | ||
| V27. Gross value of inputs | Numeric | Yes | ||
| Consumption | F30. Medical and surgical treatments | V28. Procedure Consumption Profile | Numeric | Yes2 |
| V29. Number of FUA Procedures | Numeric | Yes2 | ||
| V30. Input Consumption Profile | Numeric | Yes2 | ||
| V31. Number of FUA Inputs | Numeric | Yes2 | ||
| F32. Medicine | V32. Medication Consumption Profile | Numeric | Yes2 | |
| V33. FUA Drug Number | Numeric | Yes2 | ||
| V34. Type of Consumption (FUA Details) | Alphanumeric | No | ||
| V35. CPMS/SISMED Code (FUA Details) | Alphanumeric | No | ||
| V36. Amount delivered (FUA Details) | Numeric | No | ||
| F13. Audit, supervision, sanction, and control | V37. Marking for PCPP in SME | Numeric | Yes | |
| V38. Number of FUA rules observed in PSA | Numeric | Yes | ||
| V39. PSA result | Numeric | No | ||
| V40. SME result | Numeric | No | ||
| V41. PCPP result | Numeric | No | ||
| V42. Suspicion Marking | Numeric | Yes2 | ||
| Yes = 23 | ||||
| DATASET: RLIMA-CE | |||
|---|---|---|---|
| Observations and Variables | Original Dataset | Scenery 1 Preprocessed and Not Balanced | Scenery 2 Preprocessed and Balanced |
| Variables | 42 | 23 | 23 |
| Records | 8,453,846 (100.0%) | 8,453,846 (100.0%) | 16,648,713 (100.0%) |
| Records with suspicions | 129,490 (1.5%) | 129,490 (1.5%) | 8,324,356 (50.0%) |
| Legitimate records | 8,324,356 (98.5%) | 8,324,356 (98.5%) | 8,324,357 (50.0%) |
| RF Hyperparameters | XGB Hyperparameters | MLP Hyperparameters |
|---|---|---|
| Max_depth: None | Max_depth = 7 | Dropout = 0.2 Units = 64 Learning_rate = 0.001 |
| Estimators: 200 | Estimators = 300 | |
| Min_samples_split: 2 | Learning_rate = 0.1 | |
| Min_samples_leaf: 1 | Subsample = 1.0 | |
| Max_features = ‘sqrt’ | Colsample_bytree = 1.0 | |
| Max_depth: None | Max_depth = 7 | Dropout = 0.2 Units = 64 Learning_rate = 0.001 |
| Estimators: 200 | Estimators = 300 | |
| Min_samples_split: 2 | Learning_rate = 0.1 | |
| Min_samples_leaf: 1 | Subsample = 0.8 | |
| Max_features = ‘sqrt’ | Colsample_bytree = 0.8 |
| Metrics | Scenario 1 (No Balanced) | Scenario 2 (Balanced) | ||||
|---|---|---|---|---|---|---|
| RF | XGB | MLP | RF | XGB | MLP | |
| Precision_test | 0.829 | 0.839 | 0.714 | 0.995 | 0.973 | 0.926 |
| Recall_test | 0.380 | 0.340 | 0.239 | 0.994 | 0.984 | 0.965 |
| F1_test | 0.521 | 0.484 | 0.358 | 0.994 | 0.978 | 0.945 |
| Accuracy_test | 0.989 | 0.989 | 0.987 | 0.994 | 0.978 | 0.944 |
| MCC test | 0.5569 | 0.5300 | 0.4079 | 0.9889 | 0.9562 | 0.8888 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Villegas-Ortega, J.; Quiroz Aviles, L.N.; Arancibia, J.N.; Montenegro, W.C.; Delgadillo, R.; Mauricio, D. Methodology for Detecting Suspicious Claims in Health Insurance Using Supervised Machine Learning. Future Internet 2025, 17, 584. https://doi.org/10.3390/fi17120584
Villegas-Ortega J, Quiroz Aviles LN, Arancibia JN, Montenegro WC, Delgadillo R, Mauricio D. Methodology for Detecting Suspicious Claims in Health Insurance Using Supervised Machine Learning. Future Internet. 2025; 17(12):584. https://doi.org/10.3390/fi17120584
Chicago/Turabian StyleVillegas-Ortega, Jose, Luis Napoleon Quiroz Aviles, Juan Nazario Arancibia, Wilder Carpio Montenegro, Rosa Delgadillo, and David Mauricio. 2025. "Methodology for Detecting Suspicious Claims in Health Insurance Using Supervised Machine Learning" Future Internet 17, no. 12: 584. https://doi.org/10.3390/fi17120584
APA StyleVillegas-Ortega, J., Quiroz Aviles, L. N., Arancibia, J. N., Montenegro, W. C., Delgadillo, R., & Mauricio, D. (2025). Methodology for Detecting Suspicious Claims in Health Insurance Using Supervised Machine Learning. Future Internet, 17(12), 584. https://doi.org/10.3390/fi17120584

