Future Internet Applications in Healthcare: Big Data-Driven Fraud Detection with Machine Learning

Fourkiotis, Konstantinos P.; Tsadiras, Athanasios

doi:10.3390/fi17100460

Open AccessArticle

Future Internet Applications in Healthcare: Big Data-Driven Fraud Detection with Machine Learning

by

Konstantinos P. Fourkiotis

and

Athanasios Tsadiras

^*

School of Economics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(10), 460; https://doi.org/10.3390/fi17100460

Submission received: 9 September 2025 / Revised: 6 October 2025 / Accepted: 7 October 2025 / Published: 8 October 2025

(This article belongs to the Special Issue Information and Future Internet Security, Trust and Privacy—4th Edition)

Download

Browse Figures

Versions Notes

Abstract

Hospital fraud detection has often relied on periodic audits that miss evolving, internet-mediated patterns in electronic claims. An artificial intelligence and machine learning pipeline is being developed that is leakage-safe, imbalance aware, and aligned with operational capacity for large healthcare datasets. The preprocessing stack integrates four tables, engineers 13 features, applies imputation, categorical encoding, Power transformation, Boruta selection, and denoising autoencoder representations, with class balancing via SMOTE-ENN evaluated inside cross-validation folds. Eight algorithms are compared under a fraud-oriented composite productivity index that weighs recall, precision, MCC, F1, ROC-AUC, and G-Mean, with per-fold threshold calibration and explicit reporting of Type I and Type II errors. Multilayer perceptron attains the highest composite index, while CatBoost offers the strongest control of false positives with high accuracy. SMOTE-ENN provides limited gains once representations regularize class geometry. The calibrated scores support prepayment triage, postpayment audit, and provider-level profiling, linking alert volume to expected recovery and protecting investigator workload. Situated in the Future Internet context, this work targets internet-mediated claim flows and web-accessible provider registries. Governance procedures for drift monitoring, fairness assessment, and change control complete an internet-ready deployment path. The results indicate that disciplined preprocessing and evaluation, more than classifier choice alone, translate AI improvements into measurable economic value and sustainable fraud prevention in digital health ecosystems.

Keywords:

artificial intelligence; machine learning; healthcare fraud detection; hospital claims; imbalanced data; threshold calibration; model governance

1. Introduction

In recent decades, advancements in medical treatments, hospital processes, and pharmaceuticals have significantly improved global life expectancy. Simultaneously, artificial intelligence (AI), particularly machine learning (ML), has emerged as a critical tool for optimizing healthcare processes, including fraud detection. Traditional methods, like audits, struggle with the growing complexity of healthcare data, where fraud accounts for approximately 5% of hospital expenditures.

In modern health systems, claims are increasingly submitted, validated, and audited through internet portals and web APIs. Fraud patterns therefore emerge and evolve in online environments at web scale. Our study explicitly frames fraud detection as an internet-native analytics problem over online databases and interoperable health data services.

5G network slicing enables reliable, low-latency data exchange, while blockchain traceability ensures tamper-evident audit trails across claims. IoT devices generate real-time signals preprocessed at the edge to extract fraud-relevant patterns before secure integration into central analytics. These enablers jointly enhance timeliness, integrity, and privacy, forming a scalable, transparent framework for AI-driven healthcare fraud detection.

AI encompasses computational systems designed to emulate human cognitive processes such as learning, reasoning, and decision-making. Within the healthcare domain, AI has been increasingly utilized in a wide spectrum of applications, including diagnostic support, outcome prediction, personalized treatment planning, and fraud detection. Recent methodological advances, particularly the integration of Bayesian inference with deep learning architectures, have significantly enhanced the predictive performance of time-series models in clinical contexts, offering more reliable decision support in dynamic healthcare environments [1]. In parallel, AI has demonstrated substantial impact in pharmaceutical research, contributing to the identification of novel therapeutic targets, optimization of drug candidates, and mitigation of antimicrobial resistance through advanced pattern recognition and data driven modeling approaches [2]. Building on these advances, this study links ML pipelines with next-generation infrastructure to ensure provenance, privacy, and low-latency operation for deployable fraud analytics in hospitals.

AI advances healthcare fraud detection by analyzing large datasets, enabling real-time monitoring and early anomaly detection, surpassing traditional auditing in efficiency and accuracy [3]. Machine learning models such as Random Forest, XGBoost, CNNs, and RNNs show strong precision and adaptability, effectively capturing complex and evolving fraud patterns [4]. The operational value of these models depends on the digital infrastructure, including connectivity, secure interoperability, and distributed learning, which enables prioritization of high-risk claims within audit timeframes.

This approach uses an open-source dataset with demographics, admissions, diagnoses, and financial transactions. Feature engineering enriches the data by linking patient, provider, and claim details, enabling the detection of fraudulent patterns across hospital operations. Aligned with this internet-scale perspective, the proposed pipeline is audit-ready, preserves data provenance, maintains privacy through fold-restricted training, and supports low-latency edge-oriented data ingestion.

The exploration of healthcare fraud detection is structured into seven sections, starting with an Introduction in Section 1 that highlights how ML enhances healthcare efficiency and integrity, particularly in fraud detection for sustainable operations. The article is organized into seven sections. Section 1 introduces the problem and the role of machine learning in sustainable hospital operations. Section 2 reviews related work, while Section 3 outlines the dataset and the leakage-safe preprocessing pipeline. Section 4 details the modeling framework and evaluates eight algorithms, followed by Section 5 on the evaluation protocol, including the composite CPI and SMOTE-ENN comparisons. Section 6 discusses operational implications such as threshold policy and error trade-offs, and Section 7 concludes with findings, limitations, and directions for future research.

The contributions of this study can be summarized as follows:

A systematic comparison of eight algorithms from four methodological groups, evaluated with F1, ROC-AUC, Precision/Recall, MCC, and a composite productivity index tailored to audit workloads.
A robust preprocessing framework that integrates four data sources, engineers 13 additional features, applies imputation, categorical encoding, Power transformation, Boruta feature selection, denoising autoencoders, and SMOTE-ENN within cross-validation folds to prevent leakage.
An operational layer that connects calibrated thresholds and Type I/II errors with alert volumes and recovery expectations, supporting prepayment triage, postpayment audit, and provider-level profiling.
A deployment-oriented framework for fraud analytics that conceptually integrates 5G slicing, blockchain traceability, federated learning, and edge feature extraction to enable timely, privacy-preserving, and auditable detection.

2. Literature Review

Healthcare fraud imposes heavy costs on stakeholders and is enabled by complex revenue systems, collusion, and complex billing processes. Coding schemes like CPT with modifiers add further difficulty in detection. Effective auditing requires tailored strategies such as cross-checking billing with medical records, coding analysis, and anonymous tip lines. Building on prior work on AI-driven fraud detection frameworks [5], recent work frames anti-fraud analytics as an online service that depends on modern networked infrastructures, where connectivity, distributed compute, and verifiable data flows shape what can be detected in practice [6].

The National Health Care Anti-Fraud Association (NHCAA) estimates that fraud costs the United States approximately $68 billion annually, with other estimates reaching as high as $230 billion, representing up to 10% of total healthcare expenditure [7]. These losses could potentially be sufficient to provide coverage for the uninsured population in the United States, highlighting the severe impact of healthcare fraud [8].

The Department of Justice identifies five categories of hospital fraud: billing, illegal referrals, vendor bribes, financial schemes, and false invoices. Fraudulent billing is most frequent, but illegal referrals cause the highest losses, requiring auditors with specialized healthcare knowledge and collaboration with medical experts [9].

The study by Pan et al. [10] highlights differences in fraud roles between administrative and medical professionals, which are important for developing targeted fraud prevention strategies. Button and Gee estimate that global healthcare loses at least 3% and potentially over 5% of its spending to fraud. They highlight that between 1999 and 2006, the UK’s National Health Service (NHS) managed to reduce fraud losses by up to 60%, underscoring the critical role of accurate data and effective detection systems [11]. US healthcare insurance faces fraud losses of $125–175 billion annually within total losses of $600–850 billion [12]. Detection systems use claims, practitioner, and clinical data with supervised and unsupervised methods, adapting to changing fraud patterns [13]. A growing line of research examines how detection throughput is constrained by audit capacity and data latency, which makes threshold policy and real time ingestion central design choices [14].

Sparrow identified two distinct healthcare fraud strategies. The first one is called hit-and-run, where fraudsters rapidly submit multiple claims in a short period to maximize gains before detection, and the second, called steal a little, all the time, where small, ongoing fraudulent claims are made continuously over a longer period [15].

The high-dimensional nature of healthcare data complicates fraud detection. Techniques like spectral clustering and propositionalizing help reduce data complexity [16]. Most systems lack real time detection and rely heavily on manual oversight, but improvements like SVM and knowledge engineering have shown promising results [17]. Recent studies investigate online pipelines with streaming features, edge preprocessing close to data sources, and near real time anomaly scoring, which can shorten detection delays and reduce the window of undetected fraud [18].

Beyond model-level accuracy, healthcare fraud exhibits domain-specific risk factors that shape both feature design and evaluation. Diagnosis Related Group payment schemes are vulnerable to systematic upcoding and code creep, which increase case severity on paper and inflate reimbursements [19]. Prescription behavior can drift over time at the prescriber or clinic level, producing gradual yet anomalous shifts in drug classes, dosages, or refill cadence that are detectable with time series and anomaly-based methods [20]. Fraud is also organized through collusive provider–patient–intermediary networks where repeated co-billing, atypical referral paths, and shared attributes produce collective anomalies [21]. Graph based approaches model these multi-entity relations explicitly, and recent heterogeneous GNNs on claims graphs have shown that community structure and cross-institutional link patterns help surface collusive behavior that tabular models miss [22]. This study addresses the complementary challenge of noisy and imbalanced tabular claims by combining SMOTE ENN with a denoising autoencoder, while recognizing that graph modeling is a natural extension for detecting collusion at network scale.

In his study, Yang mentions that common fraud types include phantom claims, duplicate claims, and upcoding. Effective detection and prevention require advanced techniques such as Markov blanket filtering for feature selection, real time systems, and simplified data inputs for enhancing fraud prevention [23].

Recent research from Abdallah et al. to overcome limitations has focused on automating fraud detection techniques such as dimensionality reduction and pattern recognition to uncover hidden correlations. In this study, it is also mentioned that despite these advancements, most existing systems still heavily rely on human oversight and lack of automation for real-time detection capabilities [24].

Stowell, Schmidt, and Wadlinger show that healthcare fraud continues to threaten the US economy and public welfare despite countermeasures [25]. Wibowo et al. report its harmful effects on service quality and financial stability [26], while NHCAA estimates losses of up to 10% of annual health spending, raising costs and reducing consumer benefits [7].

Dean et al. highlight the root causes and challenges of healthcare fraud in the U.S., stressing the role of Enterprise Risk Management [27]. Aruleba and Sun show that ML can mitigate financial losses [28], while Lavanya et al. note the limits of manual detection and the value of MLPs [29]. Van Capelleveen et al. estimate U.S. fraud at $700 billion annually, affecting Medicare and Medicaid, with outlier detection as a proposed solution [30].

Healthcare fraud detection employs diverse machine learning models, from classical methods like Logistic Regression, KNN, and SVM to ensemble techniques such as Random Forest. Gradient boosting models (XGBoost, LightGBM, CatBoost) and neural networks like MLP also show strong performance in capturing complex patterns. Graph oriented approaches are increasingly used to model collusive structures among providers, patients, and intermediaries, enabling detection of communities and atypical referral pathways that are not visible to record level models [21].

The ACFE highlights AI’s role in detecting healthcare fraud by analyzing large datasets for patterns like over-billing and false claims. While challenges remain in billing code interpretation and data privacy, AI enhances real-time prevention and strengthens healthcare system efficiency and integrity [31].

Peng et al. introduce LR for binary outcomes, emphasizing coefficient and odds ratio interpretation [32], while Sperandei highlights its value in epidemiologic studies [33]. Itoo et al. achieved up to 95.9% accuracy with LR, excelling in multiple metrics [34]. Ahsan et al. compared LDA, RF, and LR for credit card fraud, showing RF best at 60:40 splits, LR at 90:10, and SMOTE effective in improving detection [35].

Zhongheng Zhang highlights the importance of choosing k and distance measures in KNN [36]. Guo et al. propose a model-based KNN using representative points to improve efficiency while maintaining accuracy [37]. Fayaz Itoo et al. compare LR, Naïve Bayes, and KNN for fraud detection, finding LR most accurate [34].

Breiman introduced Random Forest (RF) as an ensemble of decision trees achieving high accuracy, robustness to noise, and scalability to high-dimensional data [38]. Biau later analyzed its statistical properties, showing consistency and adaptability without overfitting [39]. Al-Hashedi and Magalingam further confirmed RF’s effectiveness in financial fraud detection, highlighting its robustness and accuracy [40].

Burges reviewed SVMs, covering linear and non-linear solutions, generalization, and practical implementation [41]. Perols showed SVM and LR outperform other methods in fraud detection under cost and imbalance constraints [42]. Yongli Zhang further highlights SVM’s strong accuracy and robustness, especially with small samples and high-dimensional data [43].

Chen and Guestrin (2016) introduced XGBoost, a scalable boosting system with sparsity-aware optimization for large datasets [44]. Adeola et al. showed its superior accuracy and efficiency in diagnosing kidney disease [45], while Priscilla and Prabha found that hyperparameter tuning improves performance on imbalanced data without resampling [46].

Ke et al. (2017) introduced a highly efficient gradient boosting decision tree framework that utilizes Gradient-based One-Side Sampling and Exclusive Feature Bundling to significantly reduce computational costs while maintaining accuracy, called LightGBM [47]. Du et al. in 2023 propose combining an autoencoder for dimensionality reduction with LightGBM for classification, to enhance credit card fraud detection [48].

Bentéjac et al. (2021) show CatBoost’s effectiveness in chronic kidney disease and fraud detection, noting its robustness with imbalanced data and categorical features [49]. Hancock and Khoshgoftaar confirm CatBoost’s superiority in Medicare fraud detection, outperforming other algorithms on high-cardinality features and achieving the highest AUC [50].

Rosenblatt introduced the Perceptron as the basis for MLPs [51]. Alsmadi et al. (2009) showed backpropagation greatly improves MLP performance in complex classification [52], while Nanduri et al. demonstrated their effectiveness in e-commerce fraud detection by adapting to dynamic fraud patterns [53].

Arık and Pfister introduce TabNet, a deep learning model for tabular data that uses sequential attention to focus on key features, improving interpretability and performance over traditional models [54].

Privacy preserving learning and data provenance emerge as enabling themes in healthcare settings with strict regulation. Federated learning enables training across institutions without raw data exchange, while secure aggregation and differential privacy reduce inference risks, and blockchain based traceability strengthens audit readiness by recording claim lineage and model input provenance [55].

Techniques that enhance ML models by improving data quality and structure address issues like class imbalance and feature scaling, making models more accurate and robust. These methods optimize data preprocessing, enabling better performance and reliable results in diverse applications.

Pellegrini et al. propose a learnable aggregation function (LAF) that approximates standard and complex aggregators, such as variance and skewness. LAF outperforms traditional methods by offering greater flexibility and accuracy for complex datasets and tasks [56].

Scaling and monotonic transformations were applied to stabilize features and reduce skewness. Amorim et al. show that scaler choice, such as Robust Scaler, can strongly affect accuracy under outliers and imbalance [57]. Yeo and Johnson propose a power transformation extending Box–Cox for zero and negative values [58], while the QuantileTransformer maps features to uniform or normal distributions, reducing skewness and outlier impact [59].

To reduce dimensionality while preserving information, three feature-selection methods were applied. Brown et al. use mutual information to capture target dependence while limiting redundancy [60]. Friedman, Hastie, and Tibshirani apply L1 penalties in logistic regression for sparsity and better generalization [61]. Kursa and Rudnicki propose Boruta, a random forest wrapper that retains all strongly relevant features beyond a minimal subset [62].

For compact representations, both linear and non-linear methods were evaluated. PCA captures maximal variance with minimal information loss [63], denoising autoencoders learn noise-resilient features [64], and variational autoencoders provide scalable latent-variable models through efficient stochastic optimization [65].

Given the rarity of positive cases, oversampling and hybrid resampling were applied to balance classes. SMOTE generates synthetic minority samples to expand decision regions [66], ADASYN focuses on harder minority instances to reduce bias [67], and SMOTE-ENN combines oversampling with noise removal for cleaner class separation [68].

Evaluation metrics play a crucial role in assessing the performance of machine learning models, with different metrics being appropriate for different data scenarios.

Powers compares accuracy and F1 score as evaluation metrics for ML models. Accuracy reflects overall correctness in balanced datasets [69], while F1, combining precision and recall, is preferred in imbalanced settings. In fraud detection, false negatives cause major financial losses, and false positives raise administrative costs, making these errors critical in healthcare. Such imbalance necessitates threshold calibration beyond standard accuracy metrics [67].

Fawcett (2006) introduced ROC AUC as a classifier metric, showing its value in visualizing true–false positive trade-offs and providing a single probability-based score, particularly useful for imbalanced datasets [70].

Wasserbacher and Spindler (2022) show that machine learning enhances financial planning and resource allocation through automated analysis [71]. Ali et al. demonstrate its impact on fraud detection by improving resource efficiency [72], while Shen et al. propose consumption- and morbidity-based methods for forecasting drug inventories [73].

Previous studies often used single algorithms or limited comparisons, leaving gaps in model evaluation. This study systematically assesses eight models within a unified preprocessing and evaluation framework, highlighting the value of advanced techniques such as feature engineering, SMOTE-ENN, and dimensionality reduction on high-dimensional data.

3. Dataset and Preprocessing

This section presents the dataset and the preprocessing methodology adopted in this study. Section 3.1, Section 3.2, Section 3.3, Section 3.4, Section 3.5, Section 3.6 and Section 3.7 describe the dataset composition, analytical procedures, and the sequential pipeline of scaling, feature selection, representation learning, and imbalance treatment, implemented under leakage controls to ensure model robustness.

3.1. Dataset Description

This study uses a Kaggle dataset [74] segmented into four files:

Patient demographics and reimbursements (40,474 rows, 30 columns)
Inpatient claims with admissions, discharges, and diagnoses (517,737 rows, 27 columns)
Outpatient claims for non-admitted patients (138,556 rows, 25 columns)
Provider identifiers with a binary fraud label

Table 1 summarizes the dataset features, describing demographic, clinical, and claim-related variables and their roles in building a comprehensive fraud detection framework.

3.2. Data Analysis

This study analyzed data distribution to guide fraud detection. With fraud at 9.35%, imbalance-aware training was applied, restricting resampling and threshold tuning to training folds. Exploratory analysis of financial and utilization variables (Figure 1 and Figure 2) showed skewness, heavy tails, and heterogeneity, reflecting inflationary billing and highlighting discriminative tail behavior. These insights motivated robust preprocessing with outlier handling, and normalization, supported by leverage-resistant models. Evaluation used imbalance-sensitive metrics (PR-AUC, F1, MCC) with calibrated thresholds to align decisions with operational costs. The following Figure 1 presents the distribution of key financial variables by fraud label.

Figure 2 shows claim-duration distributions, outpatients cluster at same-day visits, inpatients average longer stays (≈4 days), both with rare extremes. These patterns suggest using spline terms, duration flags, and interaction with monetary variables, ideally with stratified models. To prevent bias, splits must be leakage-controlled, while imbalance-aware objectives and threshold tuning balance prolonged fraud cases against benign long stays.

These descriptive patterns establish the statistical setting of the problem and highlight the need for a leakage-controlled framework supported by robust preprocessing and duration-sensitive representations.

3.3. Data Integration and Preprocessing

The healthcare dataset comprises four interrelated tables: Inpatient, Outpatient, Beneficiary, and Provider, as illustrated in Figure 3. The Inpatient and Outpatient tables, indexed by Claim ID, include patient, provider, physician, timing, and diagnostic details, and are first merged to consolidate hospital services. This structure is then linked to the Beneficiary table via Beneficiary ID, enriching claims with demographic and clinical attributes such as age, gender, location, chronic conditions, and mortality data. Finally, integration with the Provider table adds the binary Fraud Result attribute, which serves as the ground-truth label for supervised learning. Through this sequential merging process, raw records are transformed into a unified dataset that represents patient, claim, and provider dimensions, aligned with the fraud detection objective and used as input for all subsequent preprocessing and modeling stages.

The complete workflow is summarized in Figure 4, which illustrates the sequential stages from data integration and preprocessing to feature scaling, dimensionality reduction, imbalance handling, model training, evaluation, and final model selection.

Before modeling, we define a systematic preprocessing framework for high-dimensional, heterogeneous, and imbalanced claims data, where naïve steps risk bias and leakage. The pipeline performs controlled comparisons, at least three alternatives per stage, under stratified 5-fold cross-validation, applying all transforms strictly within folds to prevent leakage. Scaled/transformed datasets are exported for transparency and reproducibility. Subsequent subsections detail feature scaling, feature selection, representation learning, and class-imbalance handling, each reported with metrics. All features were aggregated to the provider level prior to modeling, and cross-validation used a stratified GroupKFold by Provider to prevent any provider appearing in both training and test folds, thereby eliminating provider-level leakage.

Although timestamps such as claim and admission dates could support temporal feature engineering, this study aggregated records at the provider level to prevent cross-claim leakage during cross-validation. As a result, explicit sliding-window statistics (e.g., rolling means or exponentially weighted averages) were not generated. Future work may exploit such temporal dynamics to capture gradual shifts in provider or patient behavior for improved fraud detection.

Records were aggregated at the provider level to prevent cross claim leakage during cross validation. As a result, sliding window statistics such as rolling means or exponentially weighted averages were not produced. Future research could incorporate temporal dynamics to capture gradual behavioral shifts in provider or patient activity.

3.4. Scaling

In high-dimensional claims data, feature scale and distribution affect separability, regularization, and threshold stability under imbalance. Normalization reshapes skewness and extreme values, influencing margins and calibration. For rare events like hospital fraud, the focus is not only ROC-AUC but precision–recall trade-offs that determine audit workload and recovery. This study evaluates normalization methods, attributing observed differences to the transformation rather than model capacity.

Three techniques are considered, each encoding a distinct bias about the data generating process. Robust scaling, with median and IQR, downweighs outliers by centering and scaling with robust statistics. Power transformation performs variance stabilization and approximate Gaussianization while preserving order, a combination that can improve linear decision boundaries and probability calibration. Quantile normalization to a normal output enforces a full rank-preserving warp to standard marginals. This can enhance rank-based discrimination such as PR-AUC, yet may distort local metric structure relevant to thresholded decisions. Under severe imbalance, theory thus suggests tension. Transforms that maximize recall by broadening minority mass may simultaneously inflate false positives, whereas variance-stabilizing transforms can improve precision without sacrificing too much sensitivity.

Table 2 summarizes stratified 5-fold cross validated discrimination statistics for the three normalizations under an identical pipeline. Best values per column are highlighted with bold.

Results corroborate the theoretical expectations. Power transformation delivers the most balanced profile, leading F1, ROC-AUC, and Precision, a desirable operating regime for audit-constrained fraud detection, where excess false positives carry tangible cost. Quantile normalization attains the top PR-AUC, indicating strong rank discrimination. Robust scaling amplifies Recall to 0.966, which is consistent with outlier tolerant centering, but the accompanying collapse in Precision (0.155) renders this configuration operationally impractical. Accordingly, subsequent stages in this research advance with Power normalization as the default.

Financial variables showed strong right skew due to infrequent high cost claims. The Yeo Johnson power transformation was applied to stabilize variance and compress extreme values while preserving proportionality among typical observations. This transformation improves separability and calibration in models that are sensitive to outliers and class imbalance.

3.5. Feature Selection

Feature selection addresses variance control under limited positives (≈9.4%) and removal of redundant or weak signals that reduce precision. Three methods are tested. KBest-MI, using mutual information for broad non-linear effects but with redundancy, L1-penalized logistic, enforcing sparsity for compact, interpretable sets but discarding correlated weak variables, and Boruta, a tree-based wrapper that retains all strongly relevant features, including interactions, while filtering noise. Table 3 depicts stratified 5-fold results, with best values in bold.

Table 3 shows that all three feature selection methods perform similarly, with only minor differences. Boruta has a slight edge, giving the highest F1 (0.520) and PR-AUC (0.677) while maintaining strong ROC-AUC and recall. KBest-MI and L1-Logistic yield nearly identical results. Overall, all methods retain informative features, though Boruta provides a small but consistent advantage.

Feature relevance was evaluated using the Boruta algorithm with a Random Forest base estimator of 500 trees and balanced class weighting. The procedure was performed within each cross-validation fold to prevent information leakage and was limited to 100 iterations with a significance level of 0.05. In the absence of longitudinal data, temporal proxies such as claim duration and admission discharge intervals were included. Future studies could add rolling window features and Shapley value analysis to enhance feature stability and interpretability.

3.6. Representation Learning

Representation learning, unlike feature selection, generates embeddings that capture latent structure in high-dimensional claims data. This study evaluates PCA, Denoising Autoencoders, and Variational Autoencoders as complementary methods for mapping claims into lower-dimensional spaces with more regular class boundaries. PCA achieves linear decorrelation and variance compression but lacks the ability to model curvature. Denoising Autoencoders learn stable features by reconstructing noise-corrupted inputs, often aligning with fraud signatures. Variational Autoencoders impose a latent prior for smoother, disentangled factors, though sometimes at the expense of boundary sharpness critical for minority-class detection.

Table 4 reports stratified 5-fold discrimination statistics for the three embeddings learned on the Boruta subset. Latent dimensionality is shown to contextualize compression strength. Table 4 depicts the best values in bold.

Among the methods, Denoising Autoencoders provide the most balanced performance across F1, PR-AUC, ROC-AUC, and Precision, showing that moderate non-linearity with denoising effectively captures minority structure without raising false positives. PCA achieves the highest Recall (0.895) but loses Precision, lowering F1. Variational Autoencoders fall between, with strong ROC-AUC and Recall but weaker Precision. Overall, DAE emerges as the most reliable default for downstream modeling, while PCA and VAE remain useful baselines in settings prioritizing recall or calibration.

3.7. Class Imbalance

With a minority prevalence of ≈9.4% (506 positives, 4904 negatives), fraud detection in claims data is dominated by class imbalance. Beyond probability ranking, the practical objective is to increase positive yield at tolerable false-alert rates, to improve F1 and precision–recall behavior at operating thresholds. On top of the DAE representation, this study investigates three families of resampling such as, SMOTE (synthetic minority oversampling via k-NN interpolation), ADASYN (adaptive oversampling that focuses on hard-to-learn minority instances), and SMOTE-ENN, which augments SMOTE with Edited Nearest Neighbors to remove ambiguous majority points near the decision boundary. These methods introduce different geometric priors. SMOTE smooths minority manifolds. ADASYN aggressively expands into difficult regions. ENN cleans noisy borders to favor precision.

Table 5 reports mean test metrics across folds on the DAE embeddings from the Boruta subset, comparing no resampling to SMOTE, ADASYN, and SMOTE-ENN. Bold values denote the highest performance for each metric.

ADASYN raises recall (0.911) but lowers precision (0.300), reducing F1 and making it costly for audits. SMOTE-ENN provides the most balanced results across metrics, while the small gap with no resampling suggests DAE embeddings already stabilize class geometry. Thus, SMOTE-ENN is chosen for resampled modeling, with the baseline kept as a strong reference for robustness.

4. Modeling Algorithms

Eight supervised learning algorithms were implemented to capture complementary modeling behaviors across linear, instance-based, ensemble, and neural methodologies. Logistic Regression and Support Vector Machines provide interpretable and well-calibrated probabilistic baselines suitable for structured data. Random Forest and k-Nearest Neighbors extend the analysis to non-parametric methods that capture complex local and global relationships without strong distributional assumptions. Gradient boosting frameworks, including LightGBM, XGBoost, and CatBoost, were selected for their ability to model non-linear feature interactions efficiently and to generalize well on high-dimensional, imbalanced datasets. The Multi-Layer Perceptron introduces a compact deep learning component capable of learning latent representations from denoised embeddings while maintaining manageable model complexity. These algorithms represent mature, reproducible, and computationally efficient methods widely adopted for tabular healthcare prediction. More complex architectures designed for sequential, graphical, or image-based data were excluded, as they require substantially larger datasets, introduce additional hyperparameter uncertainty, and provide limited benefits for independent claim-level records. The chosen set therefore reflects an optimal trade-off between predictive accuracy, interpretability, and operational feasibility in healthcare fraud detection.

Logistic Regression (LR)

A convex probabilistic classifier minimizing the Bernoulli negative log-likelihood, with L1/L2 regularization it is a strong, well-calibrated baseline for tabular tasks and provides interpretable coefficients. On compact DAE embeddings, the linear decision boundary often trades a small loss in recall for improved precision and stable calibration. Best values: C = 3, penalty = L2, solver = liblinear, class_weight = balanced, max_iter = 2000.

2.: K-Nearest Neighbor (kNN)

A non-parametric, instance-based rule, where consistency improves with data, but performance is sensitive to the metric and to feature scaling. On DAE, distance weighting with a mid-sized neighborhood improved robustness to local noise near the decision boundary. Best values: n_neighbors = 11, weights = distance, p = 2 (Minkowski/Euclidean), standardization applied inside CV.

3.: Random Forest

Bagging of decorrelated decision trees reduces variance while capturing interactions and non-linearities. It is robust to outliers and provides native feature importance. Best values: n_estimators = 500, max_depth = 16, min_samples_leaf = 2, max_features = log2, class_weight = balanced.

4.: Support Vector Machines (SVM)

Maximum-margin classification with the kernel trick. The Gaussian (RBF) kernel yields flexible non-linear boundaries governed by C (soft margin) and γ (locality of similarity). Best values: C = 10, gamma = 0.01, kernel = rbf, class_weight = balanced, probability = True.

5.: LightGBM (LGBM)

Leaf-wise gradient boosting with histogram splits is notably efficient on high-dimensional tables while retaining competitive accuracy. Best values: n_estimators = 600, num_leaves = 63, max_depth = 10, learning_rate = 0.05, subsample = 0.85, colsample_bytree = 0.80, min_child_samples = 20, class_weight = balanced, verbosity = −1

6.: eXtreme Gradient Boosting Classifier (XGBOOST)

Second-order gradient boosting with shrinkage, row/column subsampling, and explicit regularization. The histogram method sustains speed without sacrificing accuracy on tabular data. Best values: n_estimators = 500, max_depth = 7, learning_rate = 0.05, subsample = 0.85, colsample_bytree = 0.80, gamma = 0.0, eval_metric = logloss, tree_method = hist

7.: Categorical Boosting (CatBoost)

Boosting with ordered target statistics and permutation-driven regularization mitigates prediction shift and often performs strongly with minimal preprocessing. Best values: iterations = 700, depth = 8, learning_rate = 0.05, l2_leaf_reg = 3, loss_function = Logloss, verbose = False

8.: Multi-layer Perceptron Classifier (MLPs)

A feed-forward neural network trained by backpropagation, with ReLU and weight decay that provides a flexible non-linear baseline for tabular fraud signals. Best values: hidden_layer_sizes = (128,64), activation = relu, alpha = 1 × 10⁻⁴, learning_rate_init = 3 × 10⁻⁴, max_iter = 300

The following Table 6 lists the hyperparameter grids per algorithm where bold marks the final value used in this study.

Table 5 reports the parameter grids investigated for each learning algorithm, with the final retained configurations highlighted in bold. The adopted values reflect empirical trade-offs between model complexity, computational feasibility, and stability across folds. For ensemble methods such as Random Forest, XGBoost, LightGBM, and CatBoost, the chosen depths, number of estimators, and sampling ratios favor moderate complexity that controls variance.

In contrast, the RBF SVM relies on carefully selected regularization (C) and kernel width (γ) to produce smooth yet responsive decision surfaces. Linear models (Logistic Regression) were tuned with penalties and solvers that ensured convergence under imbalance, preserving interpretability of coefficients. For k-Nearest Neighbors, distance weighting and scaling were critical to stabilize performance, while the neural network was deliberately configured with compact hidden layers and controlled learning rates to avoid overfitting. Taken together, the bolded settings provide reproducible defaults that balance generalization and computational efficiency for downstream evaluation.

The selected algorithms represent complementary methodological paradigms encompassing linear, ensemble, kernel-based, and neural models. Logistic Regression provides a well-calibrated linear baseline under class imbalance, while tree ensembles (Random Forest, XGBoost, LightGBM, CatBoost) capture high-order feature interactions with intrinsic handling of non-linearity and categorical data. SVM with an RBF kernel models complex boundaries through similarity metrics, and the MLP extends capacity for non-linear abstraction within compact architectures suitable for tabular data. Hyperparameters were tuned within bounded grids to balance discrimination power, interpretability, and computational efficiency, ensuring comparability across models. This design enables both methodological diversity and operational feasibility for healthcare fraud detection at scale.

5. Evaluation

This study reports a strict, leakage-safe modeling on DAE embeddings with per-fold threshold tuning, evaluated by a fraud-oriented composite CPI that prioritizes detection while controlling false alarms.

C P I_{\{f r a u d\}} = 0.30 R e c a l l + 0.20 Precision + 0.20 MCC + 0.15 F 1 + 0.10 R O C - A U C + 0.05 G - M e a n

This weighting emphasizes recall (auditor coverage) and precision (audit yield), with MCC capturing balanced correctness under imbalance, and F1/ROC-AUC/G-Mean providing complementary views of ranking quality and operating balance. This study’s CPI reweighting (Recall 0.30, Precision 0.20, MCC 0.20, F1 0.15, ROC-AUC 0.10, G-Mean 0.05) compresses the top of the leaderboard into a practical tie.

The relative weights were determined based on the operational relevance of each metric to healthcare fraud auditing. Recall received the highest weight, as undetected fraud imposes the greatest expected financial loss, while precision followed to preserve investigator efficiency and maintain optimal audit yield. MCC was weighted equally to ensure balanced assessment under class imbalance and to enhance model reliability across folds. F1 was assigned a moderate weight to capture the harmonic interaction between recall and precision at the calibrated threshold. ROC-AUC contributed a smaller proportion to represent global ranking ability across thresholds, and G-Mean was included as a secondary stability indicator balancing sensitivity and specificity. Collectively, these proportional weights reflect the economic and operational priorities of fraud detection, aligning the composite score with institutional decision-making objectives.

Table 7 presents the performance of eight classifiers under the baseline regime without resampling, evaluated on the DAE embeddings with fraud-oriented CPI as the principal metric. Table 7 below depicts the best values in bold.

Table 7 summarizes the performance of the evaluated models across the composite CPI and standard classification metrics. The Neural Network (MLP) achieves the highest CPI (0.644), highlighting its balanced performance across different evaluation criteria. CatBoost stands out with the best F1 score (0.608), ROC-AUC (0.933), Precision (0.642), Specificity (0.966), Accuracy (0.930), MCC (0.572), and the lowest Type I Error (0.033), demonstrating strong discriminative ability and reliability in minimizing false alarms. On the other hand, KNN exhibits the highest Recall (0.911) and G-Mean (0.810), along with the lowest Type II Error (0.089), indicating superior capability in detecting fraud cases. Overall, the results suggest that while CatBoost provides the most consistent performance across several key metrics, KNN emphasizes detection sensitivity, and MLP offers the best composite performance as reflected in the CPI.

To further assess the impact of resampling, Table 8 reports the corresponding results when the models are trained with the SMOTE-ENN strategy. This comparison highlights how the integration of resampling techniques affects both the composite CPI and the standard evaluation metrics. Table 8 illustrates the best values with bold.

Table 8 presents the performance of the models when trained with the SMOTE-ENN resampling strategy. CatBoost achieves the best composite CPI (0.633), F1 (0.603), Precision (0.651), Specificity (0.969), Accuracy (0.931), and the lowest Type I Error (0.031), confirming its robustness under resampling. The Neural Network (MLP) attains the highest ROC-AUC (0.933), while XGBoost yields the strongest MCC (0.560). SVM (RBF) shows competitive performance with the second-highest CPI (0.633) and strong Recall (0.634). Finally, KNN maintains its strength in detection sensitivity, achieving the best Recall (0.915), G-Mean (0.810), and the lowest Type II Error (0.085). Overall, SMOTE-ENN shifts the balance, while CatBoost continues to dominate across most performance dimensions.

Table 9 provides a direct head-to-head comparison between the baseline (No Sampler) and the resampling-based (SMOTE-ENN).

The results in Table 9 demonstrate that the influence of SMOTE-ENN is heterogeneous across algorithms and evaluation metrics. Neural Networks (MLP), Random Forests, and Logistic Regression exhibit consistently positive differences in CPI, F1-score, Precision, MCC, and Accuracy, which indicates that these models attain superior overall performance in the absence of resampling.

LightGBM, CatBoost, and SVM (RBF) benefit from SMOTE-ENN through fewer Type II errors and higher recall, while KNN and XGBoost show minimal changes. Overall, classical models perform better without resampling, whereas boosting and kernel methods gain from added class balance. Aggregate results confirm higher performance without resampling, with SMOTE-ENN offering modest reductions in Type I/II errors.

MLP, Random Forest, Logistic Regression, LightGBM, and CatBoost reach higher CPI without resampling, while SVM (RBF) and XGBoost benefit from SMOTE-ENN through improved recall and fewer false negatives. KNN is largely unaffected. Thus, no-resampling is generally preferable, with SMOTE-ENN useful when reducing false negatives is the priority.

Beyond identifying the preferred sampling strategy per model, it is also essential to assess how consistently each algorithm performs across multiple metrics. Figure 5 compares algorithms across six metrics. CatBoost shows the most balanced performance, combining high F1, MCC, and ROC-AUC with strong precision and recall. XGBoost, SVM, and Logistic Regression perform well with minor trade-offs, while KNN is unbalanced, favoring recall at the cost of precision. LightGBM and Random Forest remain stable but weaker. Overall, CatBoost stands out as the most consistent algorithm.

Figure 6 presents the CatBoost confusion matrix at threshold τ ≈ 0.30. Of 5410 cases, the model correctly identified 4739 legitimate and 294 fraudulent claims, while misclassifying 165 as false positives and 212 as false negatives. This corresponds to recall 0.581, precision 0.642, specificity 0.966, and F1 score 0.609, illustrating the model’s effectiveness in distinguishing fraudulent from legitimate claims.

Figure 7 shows the CatBoost ROC curve with AUC = 0.930, confirming strong class separability. At the F1-optimal threshold (τ = 0.3), the true positive rate is 0.70 and the false positive rate 0.06, consistent with the confusion matrix. Although ROC-AUC can be optimistic in imbalanced data, the high value supports CatBoost’s robustness, while threshold choice balances false negatives against investigative cost. The ROC curve thus underpins cost-sensitive threshold selection.

To enhance financial interpretability, the confusion matrix at τ = 0.3 was extended to include predictive value metrics. The model achieved a PPV of 0.641 and an NPV of 0.957, meaning fewer than 5% of legitimate claims were wrongly flagged while one in three alerts was a false positive. Assuming a review cost of USD 50,000 per one million claims, this corresponds to an additional USD 18,000 per million in audit expenses. Although overall results are aggregated, future work may incorporate DRG-specific evaluation to capture clinical variation and guide targeted auditing.

To contextualize these findings, the performance of the proposed models was also compared with prior work. Table 10 contrasts the results of this study with those reported by Suri (2019) [75] for Logistic Regression and Random Forest. The article addresses fraudulent insurance claims in healthcare and their financial impact. It evaluates Logistic Regression and Random Forest models to identify fraud patterns, aiming to enhance automated detection systems and reduce economic losses. While both studies achieve comparable levels of accuracy, the present models exhibit markedly higher discriminative ability, as reflected in superior ROC-AUC and F1-scores.

The evaluation is methodologically robust, using stratified cross-validation with strict fold separation, threshold optimization, and controlled preprocessing. At each pipeline stage, alternative techniques were tested and only the best retained, minimizing leakage and ensuring reliable results that provide a stronger benchmark than simpler procedures such as those in Suri’s study, showing the results in Table 10, depicting the superiority of this study, with the best values illustrated in bold.

6. Discussion

In this study a leakage-safe, end-to-end pipeline was engineered to translate Future-Internet-oriented fraud-detection modeling into operational and economic value for hospitals and payers. The key finding is that model accuracy alone is insufficient to deliver impact. What matters is the alignment between statistical performance, threshold policy, and audit capacity. In large-scale audit systems, even highly accurate classifiers can generate excessive alerts that exceed review capacity or, conversely, suppress detection when thresholds are set too conservatively. Meaningful impact arises only when predictive precision and recall are balanced against institutional resources and risk tolerance, linking statistical metrics to actionable workload and recovery outcomes. By integrating threshold calibration, scenario analysis, and governance, the system converts probabilistic outputs into decisions that can be executed via web-facing workflows in prepayment triage and postpayment audit without overwhelming staff or delaying legitimate claims.

Methodologically, the design choices were made to maximize reliability rather than headline scores. All preprocessing, feature learning, and hyperparameter selection were performed within cross-validation folds to prevent information leakage. The use of both resampled and non-resampled training regimes showed that gains in F1 and recall under class-imbalance handling did not come at the expense of inflated optimism, the same evaluation protocol was maintained so that improvements could be attributed to the modeling choices rather than to procedural artifacts. The resulting performance is therefore interpretable for deployment, with calibrated scores that remain stable across folds.

The comparative analysis indicates that, for the overlapping algorithms, this study’s pipeline outperformed prior baselines reported by Suri [75] (2019). More importantly for practice, models such as multilayer perceptron and CatBoost, when paired with the full preprocessing and calibration stack, achieved the strongest results among the candidates evaluated. These gains arose from the interaction between denoising and normalization, feature selection guided by cross-validated importance, and threshold optimization against explicit audit objectives. The pattern is consistent with the view that modern architectures deliver value when embedded in disciplined evaluation and policy tuning rather than used in isolation.

Complementing discrimination metrics, the operating-point analysis using Type I and Type II errors clarifies the practical trade-offs that matter for audit capacity. Reporting Type I and Type II explicitly therefore anchors model selection to institutional risk appetite and makes threshold policy transparent.

Operational considerations were central to the evaluation. In prepayment, the calibrated scores support tiered review, sending only the upper-risk tail to investigators and preserving flow for low-risk claims. In postpayment, adjusted thresholds recover additional losses through targeted retrospective audits without requiring additional headcount. At the provider level, aggregation of claim-level scores into risk indices, controlling volume and case mix, enables proportionate responses that begin with education for borderline behavior and escalate to focused investigation when anomalies persist.

Governance completes the bridge from models to policy. Continuous monitoring for drift, calibration audits across subgroups, and fairness checks ensure that performance remains stable as coding practices and case mix evolve. A structured change-management protocol authorizes threshold updates, retraining, and version promotion, while documentation of features and attributions supports internal review, external audit, and clinician communication without revealing proprietary elements.

Ethical reliability is essential for deploying AI in healthcare. While this study focused on technical and operational aspects, potential bias from uneven demographic or insurance representation is acknowledged. Future research should include subgroup fairness testing using AUC disparity and parity metrics, along with adversarial debiasing and fairness audits to ensure equitable model performance.

The economic implications follow directly from these mechanisms. Higher precision at the review threshold reduces wasted investigator time and mitigates adversarial interactions with compliant providers. Higher recall of the calibrated operating points increases recovered losses. Because thresholds are set in view of capacity, alert volumes remain manageable, which protects day-to-day operations. Taken together, these effects raise audit productivity and reduce fraudulent leakage, the outcomes that finance and compliance units ultimately target. From an internet-operations perspective, the system supports continuous monitoring on streaming data and role-based access over online databases. Furthermore, next-generation infrastructures such as 5G connectivity, blockchain-based auditability, federated learning, and edge analytics can further enhance scalability, trust, and privacy, aligning the proposed framework with the core principles of the Future Internet.

Although the evaluation controlled for leakage and used rigorous cross-validation, external validity depends on institutional context, and local billing practices not represented in the development data.

Overall, the evidence supports the claim that edge techniques deliver practical value when embedded within a disciplined pipeline that links modeling to thresholds, capacity, and governance. The contribution of this study is not only improved accuracy but the articulation of a deployment pathway that makes those improvements actionable in hospital settings. By treating decision thresholds as policy instruments, by validating economic and operational metrics alongside standard measures, and by institutionalizing monitoring and change control, the system demonstrates a credible route from predictive performance to measurable gains in recovery and workload management.

7. Conclusions and Future Work

This study shows that disciplined preprocessing and honest evaluation are as decisive as the nominal classifier for hospital fraud detection. A leakage-safe, imbalance-aware pipeline uses Power transformation, Boruta selection, and denoising autoencoders to align statistical gains with auditor-relevant decisions that save time and enhance recovery. The composite productivity index used here ties model choice and threshold policy to precision, recall, MCC, and ROC-AUC in proportions aligned with workload and recovery priorities, and it supports both prepayment triage and postpayment audit. Within this framework, MLP achieved the highest composite CPI, CatBoost delivered the best control of false positives and strong accuracy, and resampling had limited incremental value once upstream representations regularized class geometry.

The economic perspective is central. Type I and Type II errors are reported at the operating threshold, making explicit the trade-off between investigator burden and missed fraud. Thresholds function as policy instruments that map directly to alert volume and expected yield, and calibration inside cross-validation provides estimates decision makers can trust for prepayment, postpayment, and provider-level profiling.

Labels are binary at the provider level, which constrains typology-specific conclusions and may mask specialty heterogeneity. Claims data embed coding artifacts and policy effects, and choices in feature design and temporal aggregation can shift relative performance. These factors advise caution when generalizing beyond the studied cohort.

Future Internet directions include integrating additional online data sources and federated learning across institutions to respect data sovereignty while updating models from internet-connected sites. Also, future research should strengthen the link between prediction and economics through cost-sensitive training and thresholding under explicit loss matrices, external validation by state, setting, specialty, and time, and the use of semi and self-supervised learning to exploit unlabeled claims. Graph-based models over beneficiaries, providers, and procedures may capture network structure that single-claim models miss.

In summary, a carefully engineered and leakage-safe pipeline surpasses prior baselines on overlapping algorithms and, when paired with MLP and CatBoost, delivers stronger performance at actionable thresholds. With the extensions outlined above, the framework can evolve into a continuously internet-based learning decision service that adapts to emerging fraud patterns while safeguarding operational, legal, and ethical constraints in the healthcare sector.

Author Contributions

Conceptualization, K.P.F. and A.T.; methodology, K.P.F.; software, K.P.F.; validation, K.P.F. and A.T.; formal analysis, K.P.F.; investigation, K.P.F.; resources, A.T.; data curation, K.P.F.; writing—original draft preparation, K.P.F.; writing—review and editing, A.T.; visualization, K.P.F.; supervision, A.T.; project administration, A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All models, or code, that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Irani, H.; Metsis, V. Enhancing Time-Series Prediction with Temporal Context Modeling: A Bayesian and Deep Learning Synergy. In Proceedings of the International Conference on Time Series Analysis, Gran Canaria, Spain, 15–17 July 2024. [Google Scholar]
Ghaderzadeh, M.; Shalchian, A.; Irajian, G.; Sadeghsalehi, H.; Zahedi Bialvaei, A.; Sabet, B. Artificial Intelligence in Drug Discovery and Development Against Antimicrobial Resistance: A Narrative Review. Iran. J. Med. Microbiol. 2024, 18, 135–147. [Google Scholar] [CrossRef]
Makandah, E.A.; Aniebonam, E.E.; Okpeseyi, S.B.A.; Waheed, O.O. AI-Driven Predictive Analytics for Fraud Detection in Healthcare: Developing a Proactive Approach to Identify and Prevent Fraudulent Activities. Int. J. Innov. Sci. Res. Technol. 2025, 10, 1521–1529. [Google Scholar]
Lekkala, L.R. Importance of Machine Learning Models in Healthcare Fraud Detection. Voice Publ. 2023, 9, 207–215. [Google Scholar] [CrossRef]
Fourkiotis, K.P.; Tsadiras, A. Using Boosting and Neural Networks Methods to Detect Healthcare Fraud. In IFIP International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024. IFIP Advances in Information and Communication Technology; Springer: Cham, Switzerland, 2024; Volume 713, pp. 149–162. [Google Scholar] [CrossRef]
Osama, M.; Ateya, A.A.; Sayed, M.S.; Hammad, M.; Pławiak, P.; Abd El-Latif, A.A.; Elsayed, R.A. Internet of Medical Things and Healthcare 4.0: Trends, Requirements, Challenges, and Research Directions. Sensors 2023, 23, 7435. [Google Scholar] [CrossRef]
National Health Care Anti-Fraud Association (NHCAA). The Challenge of Health Care Fraud 2017. Available online: https://iskm.issa.int/node/4681 (accessed on 10 May 2025).
Rosenbaum, S.J.; Lopez, N.; Stifler, S. Health Insurance Fraud: An Overview; 229; Health Policy and Management; Faculty Publications; Department of Health Policy, School of Public Health and Health Services, The George Washington University: Washington, DC, USA, 2009; Available online: https://hsrc.himmelfarb.gwu.edu/sphhs_policy_facpubs/229/ (accessed on 6 October 2025).
Department of Justice (DOJ). Five Doctors and Eight Healthcare Professionals Charged as Part of National Healthcare Fraud Takedown 2018. Available online: https://www.justice.gov/usao-edny/pr/five-doctors-and-eight-healthcare-professionals-charged-part-national-healthcare-fraud (accessed on 20 May 2025).
Pan, K.; Pearce, C.; Jones, S.T.; Lui, Z. Types of Hospital Frauds: Nature and Methods of Prevention. J. Forensic Investig. Account. 2023, 15, 64–80. [Google Scholar]
Gee, J.; Button, M.; Brooks, G. The Financial Cost of Healthcare Fraud—2011 Report; PKF (UK) LLP and University of Portsmouth: Portsmouth, UK, 2011. [Google Scholar]
Kelley, R. Where Can $700 Billion in Waste Be Cut Annually from the U.S. Healthcare System? White Paper; Thomson Reuters: Toronto, ON, Canada, October 2009. [Google Scholar]
Yamanishi, K.; Takeuchi, J.; Williams, G.; Milne, P. On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Min. Knowl. Discov. 2004, 8, 275–300. [Google Scholar] [CrossRef]
Matloob, I.; Khan, S.; Rukaiya, R.; Alfrahi, H.; Ali Khan, J. Healthcare Fraud Detection Using Adaptive Learning and Deep Learning Techniques. Evol. Syst. 2025, 16, 72. [Google Scholar] [CrossRef]
Weiss, L.D. Review of License to Steal: How Fraud Bleeds America’s Health Care System, by M.K. Sparrow. J. Public Health Policy 2001, 22, 361–363. [Google Scholar] [CrossRef]
Zhu, Y.; Li, J.; Wang, H.; He, P.; Liu, X. Semantic Relation Extraction with Multiple Strategies. In Proceedings of the 2013 IEEE International Conference on Healthcare Informatics (ICHI), Philadelphia, PA, USA, 9–11 September 2013; pp. 499–504. [Google Scholar] [CrossRef]
Francis, C.; Pepper, N.; Strong, H. Using Support Vector Machines to Detect Medical Fraud and Abuse. In Proceedings of the 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Boston, MA, USA, 30 August–3 September 2011; pp. 8291–8294. [Google Scholar] [CrossRef]
Islam, U.; Alatawi, M.N.; Alqazzaz, A.; Alamro, S.; Shah, B.; Moreira, F. A Hybrid Fog–Edge Computing Architecture for Real-Time Health Monitoring in IoMT Systems with Optimized Latency and Threat Resilience. Sci. Rep. 2025, 15, 25655. [Google Scholar] [CrossRef]
Crespin, D.; Dworsky, M.; Levin, J.; Ruder, T.; Whaley, C.M. Upcoding Linked to up to Two-Thirds of Growth in Highest-Intensity Hospital Discharges in Five States, 2011–2019. Health Aff. 2024, 43, 1619–1627. [Google Scholar] [CrossRef]
Zafari, B.; Ekin, T.; Ruggeri, F. Multicriteria Decision Frontiers for Prescription Anomaly Detection over Time. J. Appl. Stat. 2021, 49, 3638–3658. [Google Scholar] [CrossRef]
Liu, J.; Bier, E.; Wilson, A.; Honda, T.; Kumar, S.; Gilpin, L.; Guerra-Gomez, J.; Davies, D. Graph Analysis for Detecting Fraud, Waste, and Abuse in Healthcare Data. Proc. AAAI Conf. Artif. Intell. 2015, 29, 3912–3919. [Google Scholar] [CrossRef]
Hong, B.; Lu, P.; Xu, H.; Lu, J.; Lin, K.; Yang, F. Health Insurance Fraud Detection Based on Multi-Channel Heterogeneous Graph Structure Learning. Heliyon 2024, 10, e30045. [Google Scholar] [CrossRef] [PubMed]
Williams, G.; Baxter, R.; He, H.; Hawkins, S.; Gu, L. A Comparative Study of RNN for Outlier Detection in Data Mining. Neurocomputing 2002, 50, 213–226. [Google Scholar] [CrossRef]
Li, J.; Huang, K.; Jin, J.; Shi, J. A Survey on Statistical Methods for Health Care Fraud Detection. Health Care Manag. Sci. 2008, 11, 275–287. [Google Scholar] [CrossRef]
Stowell, N.F.; Schmidt, M.; Wadlinger, N. Healthcare Fraud under the Microscope: Improving Its Prevention. J. Financ. Crime 2018, 25, 1039–1061. [Google Scholar] [CrossRef]
Wibowo, N.M.; Utari, W.; Muhith, A.; Widiastuti, Y. Detection of Healthcare Fraud in the National Health Insurance Program Based on Cost Control. In Proceedings of the International Conference on Tourism, Economics, Accounting, Management, and Social Science (TEAMS 19), Malang, Indonesia, 15–16 August 2019; Atlantis Press: Paris, France, 2019; pp. 263–267. [Google Scholar] [CrossRef]
Dean, P.C.; Vazquez-Gonzalez, J.; Fricker, L. Causes and Challenges of Healthcare Fraud in the US. Int. J. Bus. Soc. Sci. 2013, 4, 1–8. [Google Scholar]
Aruleba, I.T.; Sun, Y. Effective credit risk prediction using ensemble classifiers with model explanation. IEEE Access 2024, 12, 94429–94441. [Google Scholar] [CrossRef]
Lavanya, S.; Manoj Kumar, S.; Mohan Kumar, P. Machine Learning Based Approaches for Healthcare Fraud Detection: A Comparative Analysis. Ann. Rom. Soc. Cell Biol. 2021, 25, 8644–8654. [Google Scholar]
Van Capelleveen, G.; Poel, M.; Mueller, R.M.; Thornton, D.; van Hillegersberg, J. Outlier Detection in Healthcare Fraud: A Case Study in the Medicaid Dental Domain. Int. J. Account. Inf. Syst. 2016, 21, 18–31. [Google Scholar] [CrossRef]
Association of Certified Fraud Examiners (ACFE). The Future of Healthcare Fraud: Artificial Intelligence. Available online: https://www.acfe.com/acfe-insights-blog/blog-detail?s=future-of-healthcare-fraud-artificial-intelligence (accessed on 20 May 2025).
Peng, C.-Y.J.; Lee, K.L.; Ingersoll, G.M. An Introduction to Logistic Regression Analysis and Reporting. J. Educ. Res. 2002, 96, 3–14. [Google Scholar] [CrossRef]
Sperandei, S. Understanding Logistic Regression Analysis. Biochem. Med. 2014, 24, 12–18. [Google Scholar] [CrossRef]
Itoo, F.; Mittal, M.; Singh, S. Comparison and Analysis of Logistic Regression, Naïve Bayes and KNN Machine Learning Algorithms for Credit Card Fraud Detection. Int. J. Inf. Technol. 2020, 13, 1503–1512. [Google Scholar] [CrossRef]
Ahsan, M.; Susanto, T.Y.; Virania, T.A.; Jaya, A.I. Credit Card Fraud Detection Using Linear Discriminant Analysis (LDA), Random Forest, and Binary Logistic Regression. Barekeng 2022, 16, 1337–1346. [Google Scholar] [CrossRef]
Li, Y.; Zhang, S.; Zhang, Q.; Zhao, J.; Chen, Y. A Survey of Computer-Aided Diagnosis for Colorectal Cancer Based on Medical Images. Ann. Transl. Med. 2019, 7, 583. [Google Scholar] [CrossRef]
Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN Model-Based Approach in Classification. In On the Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE; Meersman, R., Tari, Z., Schmidt, D.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 986–996. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Biau, G. Analysis of a Random Forests Model. J. Mach. Learn. Res. 2012, 13, 1063–1095. [Google Scholar]
Al-Hashedi, K.G.; Magalingam, P. Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019. Comput. Sci. Rev. 2021, 40, 100402. [Google Scholar] [CrossRef]
Burges, C.J.C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Global Healthcare Accreditation (GHA). The Impact of Artificial Intelligence on Healthcare. Available online: http://www.globalhha.com/doclib/data/upload/doc_con/6213287cd73c5.pdf (accessed on 25 April 2025).
Zhang, Y. Support Vector Machine Classification Algorithm and Its Application. In Information Computing and Applications; Liu, C., Wang, J., Yang, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 179–186. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ogunleye, A.; Wang, Q.-G. XGBoost Model for Chronic Kidney Disease Diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 17, 2131–2140. [Google Scholar] [CrossRef]
Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for Big Data: An Evaluation on NYCTaxi and MNIST Datasets. In Proceedings of the 2020 IEEE 22nd International Conference on High Performance Computing and Communications, Yanuca Island, Fiji, 14–16 December 2020; pp. 763–770. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Du, H.; Lv, L.; Guo, A.; Wang, H. AutoEncoder and LightGBM for Credit Card Fraud Detection Problems. Symmetry 2023, 15, 870. [Google Scholar] [CrossRef]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A Comparative Analysis of Gradient Boosting Algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Hancock, J.T.; Khoshgoftaar, T.M. Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection. SN Comput. Sci. 2021, 2, 268. [Google Scholar] [CrossRef]
Rosenblatt, F. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef] [PubMed]
Alsamdi, M.; Omar, K.; Noah, S. Back Propagation Algorithm: The Best Algorithm Among the Multi-layer Perceptron Algorithm. Int. J. Comput. Sci. Netw. Secur. 2009, 378–383. [Google Scholar]
Nanduri, J.; Liu, Y.-W.; Yang, K.; Jia, Y. E-commerce Fraud Detection Through Fraud Islands and Multi-layer Machine Learning Model. In Advances in Information and Communication; Arai, K., Kapoor, S., Bhatia, R., Eds.; Springer: Cham, Switzerland, 2020; pp. 556–570. [Google Scholar] [CrossRef]
Pellegrini, G.; Tibo, A.; Frasconi, P.; Passerini, A.; Jaeger, M. Learning Aggregation Functions. arXiv 2020, arXiv:2012.08482. [Google Scholar]
Rashid, A.; Mahfooz, S.; Ullah, F.; Alaboudi, A.; Masood, I.; Ahmad, S.; Mehmood, A.; Khalaf, O.I. Federated Learning for Healthcare Systems: A Comprehensive Survey of Techniques, Applications, and Challenges. Healthcare 2024, 12, 2587. [Google Scholar] [CrossRef]
du Preez, A.; Bhattacharya, S.; Beling, P.; Bowen, E. Fraud detection in healthcare claims using machine learning: A systematic review. Artif. Intell. Med. 2025, 160, 103061. [Google Scholar] [CrossRef]
de Amorim, L.B.V.; Cavalcanti, G.D.C.; Cruz, R.M.O. The Choice of Scaling Technique Matters for Classification Performance. Appl. Soft Comput. 2023, 133, 109924. [Google Scholar] [CrossRef]
Yeo, I.-K.; Johnson, R.A. A new family of power transformations to improve normality or symmetry. Biometrika 2000, 87, 954–959. [Google Scholar] [CrossRef]
Scikit-Learn Developers. QuantileTransformer—Transform Features Using Quantiles; Map to Uniform/Normal. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html (accessed on 15 April 2025).
Brown, G.; Pocock, A.; Zhao, M.-J.; Luján, M. A unifying framework for information-theoretic feature selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef]
Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2014, arXiv:1312.6114. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IJCNN), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
Imbalanced-Learn Developers. SMOTEENN—Combine SMOTE Oversampling with Edited Nearest Neighbors Cleaning. Available online: https://imbalanced-learn.org/stable/references/generated/imblearn.combine.SMOTEENN.html (accessed on 17 April 2025).
Powers, D.M.W. Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation. Int. J. Mach. Learn. Technol. 2011, 2, 37–63. Available online: https://arxiv.org/abs/2010.16061 (accessed on 18 April 2025).
Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
Wasserbacher, H.; Spindler, M. Machine Learning for Financial Forecasting, Planning and Analysis: Recent Developments and Pitfalls. Digit. Financ. 2022, 4, 63–88. [Google Scholar] [CrossRef]
Ali, A.; Abd Razak, S.; Othman, S.H.; Eisa, T.A.E.; Al-Dhaqm, A.; Nasser, M.; Elhassan, T.; Elshafie, H.; Saif, A. Financial Fraud Detection Based on Machine Learning: A Systematic Literature Review. Appl. Sci. 2022, 12, 9637. [Google Scholar] [CrossRef]
Shen, J.; Bu, F.; Ye, Z.; Zhang, M.; Ma, Q.; Yan, J.; Huang, T. Management of Drug Supply Chain Information Based on “Artificial Intelligence + Vendor Managed Inventory” in China: Perspective Based on a Case Study. Front. Pharmacol. 2024, 15, 1373642. [Google Scholar] [CrossRef]
Gupta, R.A. Healthcare Provider Fraud Detection Analysis. Available online: https://www.kaggle.com/datasets/rohitrox/healthcare-provider-fraud-detection-analysis (accessed on 29 September 2024).
Suri, S.; Jose, D.V. Effective Fraud Detection in Healthcare Domain Using Popular Classification Modeling Techniques. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 579–583. [Google Scholar] [CrossRef]

Figure 1. Boxplot Analysis of Fraudulent Claims.

Figure 2. Distribution of Claim Duration for Inpatients and Outpatients.

Figure 3. Data Integration Process for Healthcare Fraud Detection.

Figure 4. Workflow of Data Processing and Model Development.

Figure 5. Radar chart comparing machine learning algorithms across key evaluation metrics with Smote-ENN.

Figure 6. Confusion matrix of the CatBoost model with Smote-ENN.

Figure 7. Receiver Operating Characteristic (ROC) curve of CatBoost with AUC (Smote-ENN).

Table 1. Dataset Feature Description.

Feature	Description
BeneID	Unique identifier assigned to each beneficiary (patient) in the dataset.
DOB	Beneficiary’s date of birth.
DOD	Beneficiary’s date of death.
Race	Beneficiary’s race.
RenalDiseaseIndicator	Indicator of whether the beneficiary has been diagnosed with renal disease.
State	The state where the beneficiary resides.
County	The county where the beneficiary resides.
NoOfMonths_PartACov	The number of months the beneficiary was covered by Medicare Part A (hospital insurance) during the year.
NoOfMonths_PartBCov	The number of months the beneficiary was covered by Medicare Part B (medical insurance) during the year.
ChronicCond	Indicators of whether the beneficiary has been diagnosed with various chronic conditions, such as heart failure, cancer, Alzheimer’s, etc.
IPAnnlReimbursementAmt	The total amount reimbursed by the insurance company for inpatient services provided to the beneficiary during the year.
IPAnnDeductibleAmt	The total amount paid by the beneficiary out-of-pocket for inpatient services before insurance coverage.
OPAnnReimbursementAmt	The total amount reimbursed by the insurance company for outpatient services provided to the beneficiary during the year.
OPAnnualDeductibleAmt	The total amount paid by the beneficiary out-of-pocket for outpatient services before insurance coverage.
ClaimID	Unique identifier for each insurance claim is filed by a beneficiary.
ClaimStartDt	The date when the healthcare service or treatment covered by the claim began.
ClaimEndDt	The date when the healthcare service or treatment covered by the claim ended.
Provider	Unique identifier of the healthcare provider who submitted the claim.
InscClaimAmtReimbursed	The amount of money reimbursed by the insurance company to the healthcare provider for the services rendered.
AttendingPhysician	Unique identifier for the primary physician responsible for overseeing the patient’s care during their hospital stay.
OperatingPhysician	Unique identifier by the physician who conducted any surgical procedures during the patient’s hospital stay.
OtherPhysician	Unique identifiers for any additional physicians involved in the patient’s care during their hospital stay.
AdmissionDt	The date the patient was admitted to the hospital.
ClmAdmitDiagnosisCode	The primary diagnosis code is assigned to the patient upon admission to the hospital.
DischargeDt	The date the patient was discharged from the hospital.
DiagnosisGroupCode	A broader categorization of the patient’s diagnosis, grouping similar conditions.
ClmDiagnosisCode	(10 codes): Diagnosis codes that provide details about the patient’s medical conditions.
ClmProcedureCode	(6 codes): Procedure Codes that describe the specific medical procedures or services provided during the patient’s outpatient visit.
DeductibleAmtPaid	The amount of money paid by the patient out-of-pocket before the insurance coverage.

Table 2. Comparative Evaluation of Scalers.

Scaler	F1	PR-AUC	ROC-AUC	Precision	Recall
Power	0.529	0.664	0.931	0.386	0.842
Quantile	0.526	0.666	0.928	0.383	0.840
Robust	0.267	0.633	0.903	0.155	0.966

Table 3. Evaluation of Feature Selection Techniques.

Selector	F1	PR-AUC	ROC-AUC	Precision	Recall
Boruta	0.520	0.677	0.935	0.369	0.881
KBest MI	0.514	0.671	0.934	0.364	0.874
L1 Logistic	0.513	0.674	0.935	0.364	0.874

Table 4. Representation methods on Boruta technique.

Representation	F1	PR-AUC	ROC-AUC	Precision	Recall
DAE	0.510	0.676	0.934	0.361	0.873
VAE	0.500	0.670	0.933	0.349	0.883
PCA	0.495	0.668	0.932	0.343	0.895

Table 5. Comparison of Imbalance Mitigation Methods. Bold values denote the highest performance for each metric.

Imbalance Strategy	F1	PR-AUC	ROC-AUC	Precision	Recall
SMOTE-ENN	0.517	0.675	0.933	0.367	0.877
None	0.514	0.674	0.933	0.364	0.877
SMOTE	0.512	0.674	0.933	0.362	0.875
ADASYN	0.451	0.670	0.931	0.300	0.911

Table 6. Parameter Grid of Algorithms.

Algorithm	Parameter Grid
Logistic Regression	Regularization C: 0.01, 0.03, 0.10, 0.30, 1, 3, 10, 30 \| Penalty: l2, l1 \| Solver: liblinear, saga \| Class weight: balanced \| Max iterations: 2000
Support Vector Machine (RBF)	C: 0.1, 0.3, 1, 3, 10, 30, 100 \| Gamma: 0.001, 0.003, 0.01, 0.03, 0.10 \| Kernel: rbf \| Class weight: balanced \| Probability: True
Random Forest	n_estimators: 300, 500, 800 \| max_depth: None, 6, 10, 16, 24 \| min_samples_leaf: 1, 2, 4 \| max_features: sqrt, log2, 0.5 \| Class weight: balanced
XGBoost	n_estimators: 300, 500, 800 \| max_depth: 3, 5, 7, 9 \| learning_rate: 0.01, 0.03, 0.05, 0.10, 0.20 \| subsample: 0.70, 0.85, 1.00 \| colsample_bytree: 0.60, 0.80, 1.00 \| gamma: 0.0, 0.5, 1.0 \| eval_metric: logloss \| tree_method: hist
LightGBM	n_estimators: 300, 600, 900 \| num_leaves: 31, 63, 95, 127 \| max_depth: −1, 10, 6, 16 \| learning_rate: 0.01, 0.03, 0.05, 0.10 \| subsample: 0.70, 0.85, 1.00 \| colsample_bytree: 0.60, 0.80, 1.00 \| min_child_samples: 10, 20, 40 \| Class weight: balanced \| verbosity: −1
CatBoost	iterations: 400, 700, 1000 \| depth: 4, 6, 8, 10 \| learning_rate: 0.01, 0.03, 0.05, 0.10 \| l2_leaf_reg: 1, 3, 5, 7, 9 \| loss_function: Logloss \| verbose: False
K-Nearest Neighbors	n_neighbors: 3, 5, 7, 9, 11, 13, 15 \| weights: uniform, distance \| Minkowski p: 1, 2 \| (Scaling: StandardScaler)
Neural Network (sklearn MLP)	hidden_layer_sizes: (64,), (128,), (64,32), (128,64) \| alpha: 1 × 10⁻⁵, 1 × 10⁻⁴, 1 × 10⁻³, 1 × 10⁻² \| learning_rate_init: 1 × 10⁻⁴, 3 × 10⁻⁴, 1 × 10⁻³ \| activation: relu, tanh \| max_iter: 300

Table 7. Performance comparison of machine learning models without SMOTE-ENN across composite CPI and standard evaluation metrics.

Model	CPI	F1	ROC-AUC	Precision	Recall	Specificity	Accuracy	MCC	G-Mean	Type I Error	Type II Error
Neural Network (MLP)	0.644	0.607	0.930	0.587	0.634	0.953	0.923	0.567	0.776	0.046	0.365
Logistic Regression	0.641	0.605	0.931	0.594	0.622	0.955	0.924	0.565	0.770	0.044	0.377
CatBoost	0.639	0.608	0.933	0.642	0.581	0.966	0.930	0.572	0.748	0.033	0.418
Random Forest	0.632	0.598	0.930	0.606	0.594	0.958	0.924	0.558	0.754	0.041	0.405
XGBoost	0.629	0.597	0.930	0.621	0.579	0.963	0.927	0.558	0.745	0.036	0.420
SVM (RBF)	0.624	0.581	0.929	0.546	0.628	0.945	0.915	0.538	0.769	0.054	0.371
LightGBM	0.619	0.587	0.926	0.615	0.563	0.963	0.925	0.547	0.736	0.036	0.436
KNN	0.591	0.396	0.899	0.253	0.911	0.722	0.739	0.390	0.810	0.277	0.089

Table 8. Performance comparison of machine learning models with SMOTE-ENN across composite CPI and standard evaluation metrics.

Model	CPI	F1	ROC-AUC	Precision	Recall	MCC	G-Mean	Specificity	Accuracy	Type I Error	Type II Error
CatBoost	0.633	0.603	0.932	0.651	0.563	0.568	0.738	0.969	0.931	0.031	0.437
SVM (RBF)	0.633	0.591	0.929	0.561	0.634	0.550	0.774	0.948	0.919	0.052	0.366
XGBoost	0.631	0.599	0.929	0.615	0.587	0.560	0.751	0.962	0.926	0.039	0.413
Neural Network (MLP)	0.631	0.588	0.933	0.563	0.629	0.548	0.770	0.948	0.919	0.052	0.371
Logistic Regression	0.631	0.592	0.932	0.569	0.623	0.550	0.768	0.951	0.920	0.049	0.377
Random Forest	0.621	0.582	0.929	0.582	0.595	0.542	0.751	0.955	0.921	0.045	0.405
LightGBM	0.612	0.576	0.923	0.625	0.546	0.541	0.722	0.966	0.926	0.034	0.454
KNN	0.592	0.393	0.900	0.251	0.915	0.388	0.810	0.717	0.736	0.283	0.085

Table 9. Comparative evaluation of machine learning algorithms with and without SMOTE-ENN across composite CPI and conventional performance metrics.

Model	ΔCPI	ΔF1	ΔPrecision	ΔRecall	ΔMCC	ΔROC-AUC	ΔG-Mean	ΔSpecificity	ΔAccuracy	ΔType I Error	ΔType II Error
Neural Network (MLP)	+0.013	+0.019	+0.025	+0.006	+0.020	−0.002	+0.006	+0.006	+0.005	−0.006	−0.006
Random Forest	+0.011	+0.016	+0.025	−0.000	+0.016	+0.002	+0.004	+0.004	+0.004	−0.004	+0.000
Logistic Regression	+0.010	+0.014	+0.026	−0.000	+0.016	−0.001	+0.002	+0.005	+0.005	−0.005	+0.000
LightGBM	+0.007	+0.011	−0.010	+0.017	+0.007	+0.003	+0.014	−0.003	−0.000	+0.003	−0.017
CatBoost	+0.006	+0.006	−0.008	+0.018	+0.004	+0.001	+0.011	−0.003	−0.001	+0.003	−0.018
KNN	−0.000	+0.003	+0.002	−0.004	+0.002	−0.001	+0.001	+0.005	+0.004	−0.005	+0.004
XGBoost	−0.001	−0.002	+0.006	−0.008	−0.001	+0.001	−0.005	+0.001	+0.001	−0.002	+0.008
SVM (RBF)	−0.009	−0.009	−0.014	−0.006	−0.011	+0.001	−0.004	−0.003	−0.003	+0.003	+0.006

Notation: Δ = (No Sampler − SMOTE-ENN). “+” indicates higher without resampling; “−” indicates higher with SMOTE-ENN.

Table 10. Comparative Research Analysis of Fourkiotis & Tsadiras versus Suri [75] (2019).

Metric	Fourkiotis & Tsadiras	Suri
Logistic Regression
Accuracy	0.924	0.910
ROC AUC	0.931	0.800
F1-Score	0.605	0.591
Random Forest
Accuracy	0.924	0.8700
ROC AUC	0.930	0.8400
F1-Score	0.598	0.5400

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fourkiotis, K.P.; Tsadiras, A. Future Internet Applications in Healthcare: Big Data-Driven Fraud Detection with Machine Learning. Future Internet 2025, 17, 460. https://doi.org/10.3390/fi17100460

AMA Style

Fourkiotis KP, Tsadiras A. Future Internet Applications in Healthcare: Big Data-Driven Fraud Detection with Machine Learning. Future Internet. 2025; 17(10):460. https://doi.org/10.3390/fi17100460

Chicago/Turabian Style

Fourkiotis, Konstantinos P., and Athanasios Tsadiras. 2025. "Future Internet Applications in Healthcare: Big Data-Driven Fraud Detection with Machine Learning" Future Internet 17, no. 10: 460. https://doi.org/10.3390/fi17100460

APA Style

Fourkiotis, K. P., & Tsadiras, A. (2025). Future Internet Applications in Healthcare: Big Data-Driven Fraud Detection with Machine Learning. Future Internet, 17(10), 460. https://doi.org/10.3390/fi17100460

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Future Internet Applications in Healthcare: Big Data-Driven Fraud Detection with Machine Learning

Abstract

1. Introduction

2. Literature Review

3. Dataset and Preprocessing

3.1. Dataset Description

3.2. Data Analysis

3.3. Data Integration and Preprocessing

3.4. Scaling

3.5. Feature Selection

3.6. Representation Learning

3.7. Class Imbalance

4. Modeling Algorithms

5. Evaluation

6. Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI